This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information.
CITATION STYLE
Morin, E., Hazem, A., Loginova-Clouet, E., & Boudin, F. (2015). LINA: Identifying Comparable Documents from Wikipedia. In 8th Workshop on Building and Using Comparable Corpora, BUCC 2015 - co-located with 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2015 - Proceedings (pp. 88–91). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w15-3413
Mendeley helps you to discover research relevant for your work.