Abstract
We propose a novel open-domain questionanswering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our largescale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.
Cite
CITATION STYLE
Huber, P., Aghajanyan, A., Oguz, B., Okhonko, D., Yih, W. T., Gupta, S., & Chen, X. (2022). CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022 - Findings (pp. 2402–2420). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-naacl.184
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.