In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.
CITATION STYLE
Cacho, J. R. F., Cisneros, B., & Taghva, K. (2021). Building a Wikipedia N-GRAM Corpus. In Advances in Intelligent Systems and Computing (Vol. 1251 AISC, pp. 277–294). Springer. https://doi.org/10.1007/978-3-030-55187-2_23
Mendeley helps you to discover research relevant for your work.