Named entity recognition for icelandic: Annotated corpus and models

Svanhvít L. Ingólfsdóttir; Ásmundur A. Guðjónsson; Hrafn Loftsson

Conference Proceedings

Named entity recognition for icelandic: Annotated corpus and models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12379 LNAI 46-57

DOI: 10.1007/978-3-030-59430-5_4

5Citations

1Readers

Get full text

Abstract

Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an $$F:{1}$$ score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.

Author supplied keywords

Cite

CITATION STYLE

APA

Ingólfsdóttir, S. L., Guðjónsson, Á. A., & Loftsson, H. (2020). Named entity recognition for icelandic: Annotated corpus and models. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12379 LNAI, pp. 46–57). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-59430-5_4

Named entity recognition for icelandic: Annotated corpus and models

Abstract

Author supplied keywords

Cite

Register to see more suggestions