We address the problem of automatically cleaning a large-scale Translation Memory (TM) in a fully unsupervised fashion, i.e. without human-labelled data. We approach the task by: i) designing a set of features that capture the similarity between two text segments in different languages, ii) use them to induce reliable training labels for a subset of the translation units (TUs) contained in the TM, and iii) use the automatically labelled data to train an ensemble of binary classifiers. We apply our method to clean a test set composed of 1,000 TUs randomly extracted from the English-Italian version of MyMemory, the world's largest public TM. Our results show competitive performance not only against a strong baseline that exploits machine translation, but also against a state-of-the-art method that relies on human-labelled data.
CITATION STYLE
Sabet, M. J., Negri, M., Turchi, M., & Barbu, E. (2016). An unsupervised method for automatic translation memory cleaning. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers (pp. 287–292). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p16-2047
Mendeley helps you to discover research relevant for your work.