Detecting and dismantling composite visualizations in the scientific literature

Po Shen Lee; Bill Howe

Conference ProceedingsOPEN ACCESS

Detecting and dismantling composite visualizations in the scientific literature

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9493 247-266

DOI: 10.1007/978-3-319-27677-9_16

3Citations

17Readers

Abstract

We are analyzing the visualizations in the scientific literature to enhance search services, detect plagiarism, and study bibliometrics. An immediate problem is the ubiquitous use of multi-part figures: single images with multiple embedded sub-visualizations. Such figures account for approximately 35% of the figures in the scientific literature. Conventional image segmentation techniques and other existing approaches have been shown to be ineffective for parsing visualizations. We propose an algorithm to automatically recognize multi-chart visualizations and segment them into a set of single-chart visualizations, thereby enabling downstream analysis. Our approach first splits an image into fragments based on background color and layout patterns. An SVM-based binary classifier then distinguishes complete charts from auxiliary fragments such as labels, ticks, and legends, achieving an average 98.1% accuracy. Next, we recursively merge fragments to reconstruct complete visualizations. Finally, a scoring function is used to choose between alternative merge trees. For the multi-chart figure detection, we utilize the output of the splitting algorithm as image features to train a classifier. It can avoid unnecessary time consuming by applying the complete algorithm to determine a multi-chart visualization. To evaluate our approach, we randomly collected 880 single-chart scientific figures and 1067 multichart scientific figures from the PubMed database. For the detection, we achieve 90.2% accuracy via 10-fold cross-validation on the entire corpus. To evaluate the decomposition algorithm, we randomly extracted 261 multi-chart figures as a testing set. Our algorithm achieves 80% recall and 85% precision of perfect extractions for the common case of eight or fewer sub-figures per figure. Further, even imperfect extractions are shown to be sufficient for most chart classification and reasoning tasks associated with bibliometrics and academic search applications.

Author supplied keywords

Cite

CITATION STYLE

APA

Lee, P. S., & Howe, B. (2015). Detecting and dismantling composite visualizations in the scientific literature. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9493, pp. 247–266). Springer Verlag. https://doi.org/10.1007/978-3-319-27677-9_16

Detecting and dismantling composite visualizations in the scientific literature

Abstract

Author supplied keywords

Cite

Register to see more suggestions