Comparison of s-gram proximity measures in out-of-vocabulary word translation

Anni Järvelin; Antti Järvelin

Conference Proceedings

Comparison of s-gram proximity measures in out-of-vocabulary word translation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 5280 LNCS 75-86

DOI: 10.1007/978-3-540-89097-3_9

5Citations

4Readers

Get full text

Abstract

Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proximity measure. However, other proximity measures for s-gram matching have not been tested. In the current study the performance of seven proximity measures for classified s-grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s-grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated. © 2009 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Järvelin, A., & Järvelin, A. (2008). Comparison of s-gram proximity measures in out-of-vocabulary word translation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5280 LNCS, pp. 75–86). Springer Verlag. https://doi.org/10.1007/978-3-540-89097-3_9

Comparison of s-gram proximity measures in out-of-vocabulary word translation

Abstract

Cite

Register to see more suggestions