Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create largescale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, C3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QAbased weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at https://dataset.org/examqa/.
CITATION STYLE
Yu, D., Sun, K., Yu, D., & Cardie, C. (2021). Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 56–68). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.6
Mendeley helps you to discover research relevant for your work.