The word is mightier than the count: Accumulating translation resources from parsed parallel corpora

Stephen Nightingale; Hideki Tanaka

Conference Proceedings

The word is mightier than the count: Accumulating translation resources from parsed parallel corpora

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2588 420-431

DOI: 10.1007/3-540-36456-0_45

0Citations

1Readers

Get full text

Abstract

Large, high-quality, sentence aligned parallel corpora are hard to come by, and this makes the Statistical Machine Translation enterprise more difficult. Even noisy corpora can provide useful translation resources not otherwise available though. Many investigations have used statistical methods to find word correspondences. Often such methods suffer from overgeneration, so to correct this we filter relevant translation candidates using a lexical post-process. This dictionary lookup is so effective in fact that it brings into question the value of the statistical methods. Using a dictionary lookup against all combinations of phrase pairs as a baseline, we compare three statistical methods and report the results. The three methods are (1) Mutual Information; (2) Expectation Maximization over word co-occurrence frequencies; and (3) EM over word alignments in every sentence. We also apply the dictionary lookup as a postprocess, to tackle overgeneration.

Cite

CITATION STYLE

APA

Nightingale, S., & Tanaka, H. (2003). The word is mightier than the count: Accumulating translation resources from parsed parallel corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2588, pp. 420–431). Springer Verlag. https://doi.org/10.1007/3-540-36456-0_45

The word is mightier than the count: Accumulating translation resources from parsed parallel corpora

Abstract

Cite

Register to see more suggestions