Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-totranslation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish- English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Cite
CITATION STYLE
Anastasopoulos, A., Bansal, S., Goldwater, S., Lopez, A., & Chiang, D. (2017). Spoken term discovery for language documentation using translations. In EMNLP 2017 - 1st Workshop on Speech-Centric Natural Language Processing, SCNLP 2017 - Proceedings of the Workshop (pp. 53–58). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4607
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.