Dealing with bilingualism in automatic transcription of historical archive of Czech radio

Jan Nouza; Petr Cerva; Jan Silovsky

Conference ProceedingsOPEN ACCESS

Dealing with bilingualism in automatic transcription of historical archive of Czech radio

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8158 LNCS 238-246

DOI: 10.1007/978-3-642-41190-8_26

4Citations

4Readers

Abstract

One of the biggest challenges in the automatic transcription of the historical audio archive of Czech and Czechoslovak radio is bilingualism. Two closely related languages, Czech and Slovak, are mixed in many archive documents. Both were the official languages in former Czechoslovakia (1918-1992) and both were used in media. The two languages are considered similar, although they differ in more than 75 % of their lexical inventories, which complicates automatic speech-to-text conversion. In this paper, we present and objectively measure the difference between the two languages. After that we propose a method suitable for automatic identification of two acoustically and lexically similar languages. It is based on employing 2 size-optimized parallel lexicons and language models. On large test data, we show that the 2 languages can be distinguished with almost 99 % accuracy. Moreover, the language identification module can be easily incorporated into a 2-pass decoding scheme with almost negligible additional computation costs. The proposed method has been employed in the project aimed at the disclosure of Czech and Czechoslovak oral cultural heritage. © 2013 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Nouza, J., Cerva, P., & Silovsky, J. (2013). Dealing with bilingualism in automatic transcription of historical archive of Czech radio. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8158 LNCS, pp. 238–246). https://doi.org/10.1007/978-3-642-41190-8_26

Dealing with bilingualism in automatic transcription of historical archive of Czech radio

Abstract

Author supplied keywords

Cite

Register to see more suggestions