Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT'11 speech translation task that shows the feasibility of our approach. © 2012 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Afli, H., Barrault, L., & Schwenk, H. (2012). Parallel texts extraction from multimodal comparable corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7614 LNAI, pp. 41–51). https://doi.org/10.1007/978-3-642-33983-7_5
Mendeley helps you to discover research relevant for your work.