A bag of paths model for measuring structural similarity in Web documents

68Citations
Citations of this article
27Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages. Copyright 2003 ACM.

Cite

CITATION STYLE

APA

Joshi, S., Agrawal, N., Krishnapuram, R., & Negi, S. (2003). A bag of paths model for measuring structural similarity in Web documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 577–582). https://doi.org/10.1145/956750.956822

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free