An Analysis of Biomedical Tokenization: Problems and Strategies

Noa P. Cruz Díaz; Manuel M. Maña López

Conference ProceedingsOPEN ACCESS

An Analysis of Biomedical Tokenization: Problems and Strategies

EMNLP 2015 - 6th International Workshop on Health Text Mining and Information Analysis, LOUHI 2015 - Proceedings of the Workshop (2015) 40-49

DOI: 10.18653/v1/w15-2605

10Citations

106Readers

Abstract

Choosing the right tokenizer is a non-trivial task, especially in the biomedical domain, where it poses additional challenges, which if not resolved means the propagation of errors in successive Natural Language Processing analysis pipeline. This paper aims to identify these problematic cases and analyze the output that, a representative and widely used set of tokenizers, shows on them. This work will aid the decision making process of choosing the right strategy according to the downstream application. In addition, it will help developers to create accurate tokenization tools or improve the existing ones. A total of 14 problematic cases were described, showing biomedical samples for each of them. The outputs of 12 tokenizers were provided and discussed in relation to the level of agreement among tools.

Cite

CITATION STYLE

APA

Cruz Díaz, N. P., & Maña López, M. M. (2015). An Analysis of Biomedical Tokenization: Problems and Strategies. In EMNLP 2015 - 6th International Workshop on Health Text Mining and Information Analysis, LOUHI 2015 - Proceedings of the Workshop (pp. 40–49). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w15-2605

An Analysis of Biomedical Tokenization: Problems and Strategies

Abstract

Cite

Register to see more suggestions