YODA System for WMT16 Shared Task: Bilingual Document Alignment

13Citations
Citations of this article
73Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we address the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016. 1 Given the large amounts of data available in each web-domain, a brute force approach like finding similarities between every possible pair is a computationally expensive operation. Therefore, we start with a simple approach on matching just the web page urls after some pre-processing to reduce the number of possible pairings to a small extent. This simple approach obtained a recall of 50% and the exact matches from this approach are removed from further consideration. We built on top of this using an n-gram based approach that uses the partial English translations of French web pages and achieved a recall of 93.71% on the training pairs provided. We also outline an IR-based approach that uses both content and the meta data of each web page url, thereby obtaining a recall of 56.31%. Our final submission to this shared task using n-gram based approach achieved a recall of 93.92%.

Cite

CITATION STYLE

APA

Dara, A. A., & Lin, Y. C. (2016). YODA System for WMT16 Shared Task: Bilingual Document Alignment. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 679–684). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2366

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free