Statistical analysis of Chinese language and language modeling based on huge text corpora

Hong Zhang; Bo Xu; Taiyi Huang

Conference Proceedings

Statistical analysis of Chinese language and language modeling based on huge text corpora

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1948 279-286

DOI: 10.1007/3-540-40063-x_37

0Citations

6Readers

Get full text

Abstract

This paper presents the statistical characteristics of Chinese language based on huge text corpora. From our investigation, we find that in writing Chinese it is more likely to use long words, while in other language styles the words are shorter. In large text corpora, the number of bigram and trigram can be estimated by the size of the corpus. In the recognition experiments, we find the correlation is weak between the perplexity to either the size of the training set or the recognition character error rate. However, in order to attain good performance, the large training set above tens of million words is necessary.

Cite

CITATION STYLE

APA

Zhang, H., Xu, B., & Huang, T. (2000). Statistical analysis of Chinese language and language modeling based on huge text corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1948, pp. 279–286). Springer Verlag. https://doi.org/10.1007/3-540-40063-x_37

Statistical analysis of Chinese language and language modeling based on huge text corpora

Abstract

Cite

Register to see more suggestions