CoVA: Context-aware Visual Attention for Webpage Information Extraction

3Citations
Citations of this article
51Readers
Mendeley users who have this article in their library.

Abstract

Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale dataset1 of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.

Cite

CITATION STYLE

APA

Kumar, A., Morabia, K., Wang, J., Chang, K. C. C., & Schwing, A. (2022). CoVA: Context-aware Visual Attention for Webpage Information Extraction. In ECNLP 2022 - 5th Workshop on e-Commerce and NLP, Proceedings of the Workshop (pp. 80–90). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.ecnlp-1.11

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free