BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora

Sainik Kumar Mahata; Dipankar Das; Sivaji Bandyopadhyay

Conference ProceedingsOPEN ACCESS

BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2017) 56-59

DOI: 10.18653/v1/w17-2511

7Citations

63Readers

Abstract

A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.

Cite

CITATION STYLE

APA

Mahata, S. K., Das, D., & Bandyopadhyay, S. (2017). BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 56–59). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-2511

BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora

Abstract

Cite

Register to see more suggestions