Fine-grained semantic type discovery for heterogeneous sources using clustering

7Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of RaF-STD over alternative approaches adapted from the literature.

Cite

CITATION STYLE

APA

Piai, F., Atzeni, P., Merialdo, P., & Srivastava, D. (2023). Fine-grained semantic type discovery for heterogeneous sources using clustering. VLDB Journal, 32(2), 305–324. https://doi.org/10.1007/s00778-022-00743-3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free