MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

1Yokohama National University, 2Zhejiang University, 3University of Chinese Academy of Sciences, 4Chengdu Institute of Computer Applications, Chinese Academy of Sciences, 5Southeast University, 6University of Tsukuba, 7Pusan National University, 8Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Abstract

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese.

In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets.

Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.

Mutual Reinforcement Effect

The Mutual Reinforcement Effect between the labels of Word-level labels and text-level label within a same text.

Mutual Reinforcement Effect Image

OIELLM

The input and output of Open-domain Information Extraction Large Language Model (OIELLM).

OIELLM Figure

MMM Datasets

Multilingual Mutual Reinforcement Effect Mix Datasets Names of all sub-datasets. (The image does not represent a percentage of the actual subdataset size.)

MMM Figure



Related Links

Let me conclude by thanking the contributors to the MMM dataset for contributing the fundamental dataset. And the pioneering researchers who selflessly contributed.

1. Japanese Wikipedia NER dataset - Takahiro Omi - https://github.com/stockmarkteam/ner-wikipedia-dataset

2. JGLUE: Japanese General Language Understanding Evaluation - Kentaro Kurihara, Daisuke Kawahara, Tomohide Shibata - https://github.com/yahoojapan/JGLUE?tab=readme-ov-file

3. livedoor news corpus - 関口宏司 - https://www.rondhuit.com/download.html

4. UniversalNER - Wenxuan Zhou - https://arxiv.org/abs/2308.03279

Here is also some previous work on the MRE series. You may be able to get more information about MRE from these works.

1. Mutual Reinforcement Effects in Japanese Sentence Classification and Named Entity Recognition Tasks (2023) Chengguang Gan, Qinghao Zhang, Tatsunori Mori

2. USA: Universal Sentiment Analysis Model & Construction of Japanese Sentiment Text Classification and Part of Speech Dataset (2023) Chengguang Gan, Qinghao Zhang, Tatsunori Mori

3. GIELLM: Japanese General Information Extraction Large Language Model Utilizing Mutual Reinforcement Effect (2023) Chengguang Gan, Qinghao Zhang, Tatsunori Mori

BibTeX

@misc{gan2024mmmmultilingualmutualreinforcement, title={MMM: Multilingual Mutual Reinforcement Effect Mix
      Datasets & Test with Open-domain Information Extraction Large Language Models}, author={Chengguang Gan and Qingyu 
      Yin and Xinyang He and Hanjun Wei and Yunhao Liang and Younghun Lim and Shijian Wang and Hexiang Huang and
      Qinghao Zhang and Shiwen Ni and Tatsunori Mori}, year={2024}, eprint={2407.10953}, archivePrefix={arXiv},
      primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.10953}, 
      }