A simhash-based generalized framework for citation matching in mapreduce

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum’s citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.

Cite

CITATION STYLE

APA

Wang, P., Wu, B., Li, X., Wang, L., & Wang, B. (2015). A simhash-based generalized framework for citation matching in mapreduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9441, pp. 78–90). Springer Verlag. https://doi.org/10.1007/978-3-319-25660-3_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free