DIADEM: Thousands of Websites to a Single Database

  • Furche T
  • Gottlob G
  • Grasso G
 et al. 
  • 41

    Readers

    Mendeley users who have this article in their library.
  • 27

    Citations

    Citations of this article.

Abstract

The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of websites has long proven elusive, despite its central role in the “web of data”. Through an extensive evaluation spanning over 10000 web sites from multiple application domains, we show that automatic, yet accurate full-site extraction is no longer a distant dream. DIADEM is the first automatic full-site extraction system that is able to ex- tract structured data from different domains at very high accuracy. It combines automated exploration of websites, identification of relevant data, and induction of exhaustive wrappers. Automating these components is the first challenge. DIADEM overcomes this challenge by combining phenomenological and ontological knowl- edge. Integrating these components is the second challenge. DIA- DEM overcomes this challenge through a self-adaptive network of relational transducers that produces effective wrappers for a wide variety of websites. Our extensive and publicly available evaluation shows that, for more than 90% of sites from three domains, DIADEM obtains an effective wrapper that extracts all relevant data with 97% average precision. DIADEM also tolerates noisy entity recognisers, and its components individually outperform comparable approaches.

Author-supplied keywords

  • data extraction
  • deep web
  • wrapper induction

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • Tim Furche

  • Georg Gottlob

  • Giovanni Grasso

  • Xiaonan Guo

  • Giorgio Orsi

  • Christian Schallhart

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free