Improving recall of regular expressions for information extraction

Karin Murthy; P. Deepak; Prasad M. Deshpande

Conference Proceedings

Improving recall of regular expressions for information extraction

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7651 LNCS 455-467

DOI: 10.1007/978-3-642-35063-4_33

17Citations

8Readers

Get full text

Abstract

Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Murthy, K., Deepak, P., & Deshpande, P. M. (2012). Improving recall of regular expressions for information extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7651 LNCS, pp. 455–467). https://doi.org/10.1007/978-3-642-35063-4_33

Improving recall of regular expressions for information extraction

Abstract

Cite

Register to see more suggestions