Retrieving information about highly ambiguous gene/protein homonyms is a challenge, in particular where their non-protein meanings are more frequent than their protein meaning (e. g., SAH or HF). Due to their limited coverage in common benchmarking data sets, the performance of existing gene/protein recognition tools on these problematic cases is hard to assess. We uniformly sample a corpus of eight ambiguous gene/protein abbreviations from MEDLINEr and provide manual annotations for each mention of these abbreviations.1 Based on this resource, we show that available gene recognition tools such as conditional random fields (CRF) trained on BioCreative 2 NER data or GNAT tend to underperform on this phenomenon. We propose to extend existing gene recognition approaches by combining a CRF and a support vector machine. In a cross-entity evaluation and without taking any entity-specific information into account, our model achieves a gain of 6 points F1-Measure over our best baseline which checks for the occurrence of a long form of the abbreviation and more than 9 points over all existing tools investigated.
CITATION STYLE
Hartung, M., Klinger, R., Zwick, M., & Cimiano, P. (2014). Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 118–127). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-3418
Mendeley helps you to discover research relevant for your work.