We propose a system for automatic detection of duplicate entries in a repository of semi-structured text documents. The proposed system employs text-entity recognition to extract information regarding time, location, names of persons and organizations, as well as events described within the document content. With structured representations of the content, called “metamodels”, we group the entries into clusters based on the similarity of the contents. Then we apply machine-learning algorithms to the clusters to carry out duplicate detection. We present results regarding precision, recall, and F-value of the proposed system.
CITATION STYLE
Cordero Cruz, J. A., Garza, S. E., & Schaeffer, S. E. (2014). Entity recognition for duplicate filtering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8821, pp. 253–264). Springer Verlag. https://doi.org/10.1007/978-3-319-11988-5_24
Mendeley helps you to discover research relevant for your work.