Google books ngram: Problems of representativeness and data reliability

Valery D. Solovyev; Vladimir V. Bochkarev; Svetlana S. Akhtyamova

Conference Proceedings

Google books ngram: Problems of representativeness and data reliability

Communications in Computer and Information Science (2020) 1223 CCIS 147-162

DOI: 10.1007/978-3-030-51913-1_10

6Citations

6Readers

Get full text

Abstract

The article discusses representativeness of Google Books Ngram as a multi-purpose corpus. Criticism of the corpus is analysed and discussed. A comparative study of the GBN data and the data obtained using the Russian National Corpus and the General Internet Corpus of Russian is performed to show that the Google Books Ngram corpus can be successfully used for corpus-based studies. A new concept “diachronically balanced corpus” is introduced. Besides, the article describes the problems of word spelling and metadata errors presented in the GBN corpus and proposes possible ways of improving quality of the GBN data.

Author supplied keywords

Cite

CITATION STYLE

APA

Solovyev, V. D., Bochkarev, V. V., & Akhtyamova, S. S. (2020). Google books ngram: Problems of representativeness and data reliability. In Communications in Computer and Information Science (Vol. 1223 CCIS, pp. 147–162). Springer. https://doi.org/10.1007/978-3-030-51913-1_10

Google books ngram: Problems of representativeness and data reliability

Abstract

Author supplied keywords

Cite

Register to see more suggestions