Visual validation is the process of validating sets of extracted entities by means of visual information. The main advantage of visual validation is to make use of visual information for web information extraction without impacting on the robustness of extractors. In this paper, we show that unsupervised visual validation can be used to create robust web data extractors. More precisely, we evaluate the performance of visual validation on a corpus of visually heterogeneous documents. The selected extraction task consists in extracting the price, name, description, and SKU of unspecified products from unseen documents. Our corpus contains 1000 various products from 100 different sources, which we render public. Results also show that visual validation improves web data extraction even when the extractor is trained with visual features.
CITATION STYLE
Potvin, B., & Villemaire, R. (2019). Robust Web Data Extraction Based on Unsupervised Visual Validation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11431 LNAI, pp. 77–89). Springer Verlag. https://doi.org/10.1007/978-3-030-14799-0_7
Mendeley helps you to discover research relevant for your work.