Building a Wikipedia N-GRAM Corpus

Jorge Ramón Fonseca Cacho; Ben Cisneros; Kazem Taghva

Conference Proceedings

Building a Wikipedia N-GRAM Corpus

Cacho J
Cisneros B
Taghva K

Advances in Intelligent Systems and Computing (2021) 1251 AISC 277-294

DOI: 10.1007/978-3-030-55187-2_23

2Citations

1Readers

Get full text

Abstract

In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.

Author supplied keywords

NGRAM
NLP
OCR
Wiki
Wikipedia

Cite

CITATION STYLE

APA

Cacho, J. R. F., Cisneros, B., & Taghva, K. (2021). Building a Wikipedia N-GRAM Corpus. In Advances in Intelligent Systems and Computing (Vol. 1251 AISC, pp. 277–294). Springer. https://doi.org/10.1007/978-3-030-55187-2_23

Building a Wikipedia N-GRAM Corpus

Abstract

Author supplied keywords

Cite

Register to see more suggestions