Robust Web Data Extraction Based on Unsupervised Visual Validation

Benoit Potvin; Roger Villemaire

Conference Proceedings

Robust Web Data Extraction Based on Unsupervised Visual Validation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11431 LNAI 77-89

DOI: 10.1007/978-3-030-14799-0_7

5Citations

6Readers

Get full text

Abstract

Visual validation is the process of validating sets of extracted entities by means of visual information. The main advantage of visual validation is to make use of visual information for web information extraction without impacting on the robustness of extractors. In this paper, we show that unsupervised visual validation can be used to create robust web data extractors. More precisely, we evaluate the performance of visual validation on a corpus of visually heterogeneous documents. The selected extraction task consists in extracting the price, name, description, and SKU of unspecified products from unseen documents. Our corpus contains 1000 various products from 100 different sources, which we render public. Results also show that visual validation improves web data extraction even when the extractor is trained with visual features.

Author supplied keywords

Cite

CITATION STYLE

APA

Potvin, B., & Villemaire, R. (2019). Robust Web Data Extraction Based on Unsupervised Visual Validation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11431 LNAI, pp. 77–89). Springer Verlag. https://doi.org/10.1007/978-3-030-14799-0_7

Robust Web Data Extraction Based on Unsupervised Visual Validation

Abstract

Author supplied keywords

Cite

Register to see more suggestions