Improving NCD accuracy by combining document segmentation and document distortion

6Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature.

Cite

CITATION STYLE

APA

Granados, A., Martínez, R., Camacho, D., & Rodríguez, F. de B. (2014). Improving NCD accuracy by combining document segmentation and document distortion. Knowledge and Information Systems, 41(1), 223–245. https://doi.org/10.1007/s10115-013-0664-4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free