Parallel texts extraction from multimodal comparable corpora

Haithem Afli; Loïc Barrault; Holger Schwenk

Conference Proceedings

Parallel texts extraction from multimodal comparable corpora

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7614 LNAI 41-51

DOI: 10.1007/978-3-642-33983-7_5

2Citations

8Readers

Get full text

Abstract

Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT'11 speech translation task that shows the feasibility of our approach. © 2012 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Afli, H., Barrault, L., & Schwenk, H. (2012). Parallel texts extraction from multimodal comparable corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7614 LNAI, pp. 41–51). https://doi.org/10.1007/978-3-642-33983-7_5

Parallel texts extraction from multimodal comparable corpora

Abstract

Author supplied keywords

Cite

Register to see more suggestions