Abstract
Hidden markov model (HMM) is frequently used for Pinyin-to-Chinese conversion. But it only captures the dependency with the preceding character. Higher order markov models can bring higher accuracy, but are computationally unaffordable to average PC settings. We propose a segment-based hidden markov model (SHMM), which has the same magnitude of complexity as first- order HMM, but generates higher decoding accuracy. SHMM tells a word from a bigram connecting two words, and assigns a reasonable probability to words as a whole. It is more powerful than HMM to decode words containing over two characters. We conduct a comprehensive Pinyin-to-Chinese conversion evaluation on Lancaster corpus. The experiment shows the perfect sentence accuracy is improved from 34.7% (HMM) to 43.3% (SHMM). The one-error sentence accuracy is increased from 72.7% to 78.3%. Furthermore, SHMM can seamlessly integrate with pinyin typing correction, acronym pinyin input, user-defined words, and self- adaptive learning all of which are a must for a commercial Pinyin- to-Chinese conversion product in order to improve the efficiency of pinyin input. Copyright 2007 ACM.
Author supplied keywords
Cite
CITATION STYLE
Zhou, X., Hu, X., Zhang, X., & Shen, X. (2007). A segment-based hidden markov model for real-setting pinyin-to-Chinese conversion. In International Conference on Information and Knowledge Management, Proceedings (pp. 1027–1030). https://doi.org/10.1145/1321440.1321602
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.