YODA System for WMT16 Shared Task: Bilingual Document Alignment

Aswarth Abhilash Dara; Yiu Chang Lin

Conference ProceedingsOPEN ACCESS

YODA System for WMT16 Shared Task: Bilingual Document Alignment

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2016) 2 679-684

DOI: 10.18653/v1/w16-2366

13Citations

73Readers

Abstract

In this paper, we address the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016. 1 Given the large amounts of data available in each web-domain, a brute force approach like finding similarities between every possible pair is a computationally expensive operation. Therefore, we start with a simple approach on matching just the web page urls after some pre-processing to reduce the number of possible pairings to a small extent. This simple approach obtained a recall of 50% and the exact matches from this approach are removed from further consideration. We built on top of this using an n-gram based approach that uses the partial English translations of French web pages and achieved a recall of 93.71% on the training pairs provided. We also outline an IR-based approach that uses both content and the meta data of each web page url, thereby obtaining a recall of 56.31%. Our final submission to this shared task using n-gram based approach achieved a recall of 93.92%.

Cite

CITATION STYLE

APA

Dara, A. A., & Lin, Y. C. (2016). YODA System for WMT16 Shared Task: Bilingual Document Alignment. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 679–684). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2366

YODA System for WMT16 Shared Task: Bilingual Document Alignment

Abstract

Cite

Register to see more suggestions