Adapting lexical and language models for transcription of highly spontaneous spoken Czech

5Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we introduce transformations between the most frequent colloquial words and their counterparts in formal Czech. This allows us a) to combine the small spoken corpora with much larger corpora of more formal texts, b) to optimize the recognizer's lexicon, and c) to solve the data sparsity problem when computing a probabilistic language model. We have applied this approach in the design of a system for transcription of spontaneous telephone conversations. Its recent version operates with accuracy about 48% and the proposed transformations together with corpora mixing contributed to 9% improvement compared to the baseline system. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Nouza, J., & Silovský, J. (2010). Adapting lexical and language models for transcription of highly spontaneous spoken Czech. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6231 LNAI, pp. 377–384). https://doi.org/10.1007/978-3-642-15760-8_48

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free