A simhash-based generalized framework for citation matching in mapreduce

Pengsen Wang; Bin Wu; Xiaoming Li; Lin Wang; Bai Wang

Conference Proceedings

A simhash-based generalized framework for citation matching in mapreduce

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9441 78-90

DOI: 10.1007/978-3-319-25660-3_7

0Citations

3Readers

Get full text

Abstract

Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum’s citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, P., Wu, B., Li, X., Wang, L., & Wang, B. (2015). A simhash-based generalized framework for citation matching in mapreduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9441, pp. 78–90). Springer Verlag. https://doi.org/10.1007/978-3-319-25660-3_7

A simhash-based generalized framework for citation matching in mapreduce

Abstract

Author supplied keywords

Cite

Register to see more suggestions