Optimizing CRF-based model for proper name recognition in Polish texts

Michał Marcińczuk; Maciej Janicki

Conference Proceedings

Optimizing CRF-based model for proper name recognition in Polish texts

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7181 LNCS(PART 1) 258-269

DOI: 10.1007/978-3-642-28604-9_22

12Citations

10Readers

Get full text

Abstract

In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of external sources of knowledge. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using 10-fold cross validation on a Corpus of Stock Exchange Reports and a cross-domain evaluation on a Corpus of Economic News. An additional corpus of Wikipedia articles, namely InfiKorp is used in the feature selection. Finally, we evaluate three configurations of proposed modifications. The top configuration improved the final result from 94.53% to 95.65% of F-measure for single domain and from 70.86% to 79.63% for cross-domain evaluation. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Marcińczuk, M., & Janicki, M. (2012). Optimizing CRF-based model for proper name recognition in Polish texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7181 LNCS, pp. 258–269). https://doi.org/10.1007/978-3-642-28604-9_22

Optimizing CRF-based model for proper name recognition in Polish texts

Abstract

Author supplied keywords

Cite

Register to see more suggestions