An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentions using a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentions into instances of a reference schema to facilitate mention comparisons. PERF also uses "virtual attribute dependencies" to "enhance" mentions with additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mention attributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions. © 2008 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Lochovsky, L., & Topaloglou, T. (2008). An entity resolution framework for deduplicating proteins. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5109 LNBI, pp. 92–107). https://doi.org/10.1007/978-3-540-69828-9_9
Mendeley helps you to discover research relevant for your work.