We present in this paper the participation of the University of Hamburg in the Biomedical Translation Task of the Second Conference on Machine Translation (WMT 2017). Our contribution lies in adopting a new direction for performing data selection for Machine Translation via Paragraph Vector and a Feed Forward Neural Network Classifier. Continuous distributed vector representations of the sentences are used as features for the binary classifier. Most approaches in data selection rely on scoring and ranking general domain sentences with respect to their similarity to the in-domain and setting a range of thresholds for selecting a percentage of them for training various MT systems. The novelty of our method consists in developing an automatic threshold detection paradigm for data selection which provides an efficient and simple way for selecting the most similar sentences to the in-domain. Encouraging results are obtained using this approach for seven language pairs and four data sets.
CITATION STYLE
Duma, M. S., & Menzel, W. (2017). Automatic threshold detection for data selection in machine translation. In WMT 2017 - 2nd Conference on Machine Translation, Proceedings (pp. 483–488). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4754
Mendeley helps you to discover research relevant for your work.