Clustering Structured web sources: A schema-based, model-differentiation approach

Bin He; Tao Tao; Kevin Chen Chuan Chang

Journal Article

Clustering Structured web sources: A schema-based, model-differentiation approach

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2004) 3268 536-546

DOI: 10.1007/978-3-540-30192-9_53

18Citations

10Readers

Get full text

Abstract

The Web has been rapidly "deepened" with the prevalence of databases online. On this "deep Web," numerous sources are structured, providing schemarich data. Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm. © Springer-Verlag 2004.

Cite

CITATION STYLE

APA

He, B., Tao, T., & Chang, K. C. C. (2004). Clustering Structured web sources: A schema-based, model-differentiation approach. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3268, 536–546. https://doi.org/10.1007/978-3-540-30192-9_53

Clustering Structured web sources: A schema-based, model-differentiation approach

Abstract

Cite

Register to see more suggestions