A bag of paths model for measuring structural similarity in Web documents

Sachindra Joshi; Neeraj Agrawal; Raghu Krishnapuram; Sumit Negi

Conference Proceedings

A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003) 577-582

DOI: 10.1145/956750.956822

68Citations

27Readers

Get full text

Abstract

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages. Copyright 2003 ACM.

Author supplied keywords

Cite

CITATION STYLE

APA

Joshi, S., Agrawal, N., Krishnapuram, R., & Negi, S. (2003). A bag of paths model for measuring structural similarity in Web documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 577–582). https://doi.org/10.1145/956750.956822

A bag of paths model for measuring structural similarity in Web documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions