Analysis of Czech web 1T 5-gram corpus and its comparison with Czech national corpus data

Václav Procházka; Petr Pollák

Conference Proceedings

Analysis of Czech web 1T 5-gram corpus and its comparison with Czech national corpus data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 6231 LNAI 181-188

DOI: 10.1007/978-3-642-15760-8_24

3Citations

6Readers

Get full text

Abstract

In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database. © 2010 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Procházka, V., & Pollák, P. (2010). Analysis of Czech web 1T 5-gram corpus and its comparison with Czech national corpus data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6231 LNAI, pp. 181–188). https://doi.org/10.1007/978-3-642-15760-8_24

Analysis of Czech web 1T 5-gram corpus and its comparison with Czech national corpus data

Abstract

Author supplied keywords

Cite

Register to see more suggestions