Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proximity measure. However, other proximity measures for s-gram matching have not been tested. In the current study the performance of seven proximity measures for classified s-grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s-grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated. © 2009 Springer Berlin Heidelberg.
CITATION STYLE
Järvelin, A., & Järvelin, A. (2008). Comparison of s-gram proximity measures in out-of-vocabulary word translation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5280 LNCS, pp. 75–86). Springer Verlag. https://doi.org/10.1007/978-3-540-89097-3_9
Mendeley helps you to discover research relevant for your work.