Building a multilingual named entity-annotated corpus using annotation projection

Maud Ehrmann; Marco Turchi; Ralf Steinberger

Conference Proceedings

Building a multilingual named entity-annotated corpus using annotation projection

International Conference Recent Advances in Natural Language Processing, RANLP (2011) 118-124

ISSN: 13138502

50Citations

93Readers

Abstract

As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse.

Cite

CITATION STYLE

APA

Ehrmann, M., Turchi, M., & Steinberger, R. (2011). Building a multilingual named entity-annotated corpus using annotation projection. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 118–124). Incoma Ltd.

Building a multilingual named entity-annotated corpus using annotation projection

Abstract

Cite

Register to see more suggestions