CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

5Citations
Citations of this article
57Readers
Mendeley users who have this article in their library.

Abstract

We propose a novel open-domain questionanswering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our largescale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

Cite

CITATION STYLE

APA

Huber, P., Aghajanyan, A., Oguz, B., Okhonko, D., Yih, W. T., Gupta, S., & Chen, X. (2022). CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022 - Findings (pp. 2402–2420). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-naacl.184

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free