The Effect of Corpora Size on Performance of Named Entity Recognition

0Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The amount of on-line text available is continuously growing and has reached hundreds of billions of words. A lot of research has been done using this data, trying to improve results on different problems. Algorithms are continuously optimized, tested and compared after training on corpora with only one million words or less. Most research focuses on the accuracy of the results generated by these algorithms often overlooking the running time or the cost associated with running those algorithms. The main goal of this paper is to show the effect that large data has on the running time and performance of those algorithms in Natural Language Processing. To achieve this goal, three Named Entity Recognition tools were selected. We evaluated the trade-off between quality, running time, and the effect of increasing the data size on performance on the best variety of tools in NER domain. The result shows that the existing tools are unable to work with increasing data size. Also by increasing data size quality is increasing but performance is decreasing; therefore, rendering the existing tools inefficient. By optimizing these tools, large data sizes can be processed; unfortunately, latency is still high.

Cite

CITATION STYLE

APA

Liaghat, Z. (2018). The Effect of Corpora Size on Performance of Named Entity Recognition. In Studies in Big Data (Vol. 27, pp. 93–105). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-60255-4_8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free