Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an $$F:{1}$$ score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.
CITATION STYLE
Ingólfsdóttir, S. L., Guðjónsson, Á. A., & Loftsson, H. (2020). Named entity recognition for icelandic: Annotated corpus and models. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12379 LNAI, pp. 46–57). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-59430-5_4
Mendeley helps you to discover research relevant for your work.