Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM

118Citations
Citations of this article
115Readers
Mendeley users who have this article in their library.

Abstract

We present a method capable of extracting parallel sentences from far more disparate “very-non-parallel corpora” than previous “comparable corpora” methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations from these documents. But unlike previous methods, we extend this with an iterative bootstrapping framework based on the principle of “find-one-get-more”, which claims that documents found to contain one pair of parallel sentences must contain others even if the documents are judged to be of low similarity. We re-match documents based on extracted sentence pairs, and refine the mining process iteratively until convergence. This novel “find-one-get-more” principle allows us to add more parallel sentences from dissimilar documents, to the baseline set. Experimental results show that our proposed method is nearly 50% more effective than the baseline method without iteration. We also show that our method is effective in boosting the performance of the IBM Model 4 EM lexical learner as the latter, though stronger than Model 1 used in previous work, does not perform well on data from very-non-parallel corpus.

Cite

CITATION STYLE

APA

Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 - A meeting of SIGDAT, a Special Interest Group of the ACL held in conjunction with ACL 2004 (pp. 57–63). Association for Computational Linguistics (ACL).

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free