Natural Key Discovery in Wikipedia Tables

12Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table's version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.

Cite

CITATION STYLE

APA

Bornemann, L., Bleifuß, T., Kalashnikov, D. V., Naumann, F., & Srivastava, D. (2020). Natural Key Discovery in Wikipedia Tables. In The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020 (pp. 2789–2795). Association for Computing Machinery, Inc. https://doi.org/10.1145/3366423.3380039

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free