Optimizing sentence boundary detection for Croatian

Frane Šarić; Jan Šnajder; Bojana Dalbelo Bašić

Conference Proceedings

Optimizing sentence boundary detection for Croatian

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7499 LNAI 105-111

DOI: 10.1007/978-3-642-32790-2_12

1Citations

3Readers

Get full text

Abstract

A number of natural language processing tasks depend on segmenting text into sentences. Tools that perform sentence boundary detection achieve excellent performance for some languages. We have tried to train a few publicly available language independent tools to perform sentence boundary detection for Croatian. The initial results show that off-the-shelf methods used for English do not work particularly well for Croatian. After performing error analysis, we propose additional features that help in resolving some of the most common boundary detection errors. We use unsupervised methods on a large Croatian corpus to collect likely sentence starters, abbreviations, and honorifics. In addition to some commonly used features, we use these lists of words as features for classifier that is trained on a smaller corpus with manually annotated sentences. The method we propose advances the state-of-the art accuracy for Croatian sentence boundary detection on news corpora to 99.5%. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Šarić, F., Šnajder, J., & Dalbelo Bašić, B. (2012). Optimizing sentence boundary detection for Croatian. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7499 LNAI, pp. 105–111). https://doi.org/10.1007/978-3-642-32790-2_12

Optimizing sentence boundary detection for Croatian

Abstract

Cite

Register to see more suggestions