Abstract
We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of RaF-STD over alternative approaches adapted from the literature.
Author supplied keywords
Cite
CITATION STYLE
Piai, F., Atzeni, P., Merialdo, P., & Srivastava, D. (2023). Fine-grained semantic type discovery for heterogeneous sources using clustering. VLDB Journal, 32(2), 305–324. https://doi.org/10.1007/s00778-022-00743-3
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.