Extracting parallel phrases from comparable data

7Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results showthat the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

Cite

CITATION STYLE

APA

Hewavitharana, S., & Vogel, S. (2013). Extracting parallel phrases from comparable data. In Building and Using Comparable Corpora (pp. 191–204). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_10

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free