Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs

Alfonso Murolo; Moira C. Norrie

Conference ProceedingsOPEN ACCESS

Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9671 114-131

DOI: 10.1007/978-3-319-38791-8_7

3Citations

9Readers

Abstract

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, a system that semi-automatically extracts data records from web pages based on a combination of structural and visual features. It runs in a generalpurpose browser, taking advantage of direct access to the complete CSS3 spectrum and the capability to trigger and execute JavaScript in the page. The user sees record matching in real-time and dynamically adapts the process if required. We present the details of the matching algorithms and provide an evaluation of them based on the top ten Alexa websites.

Author supplied keywords

Cite

CITATION STYLE

APA

Murolo, A., & Norrie, M. C. (2016). Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9671, pp. 114–131). Springer Verlag. https://doi.org/10.1007/978-3-319-38791-8_7

Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs

Abstract

Author supplied keywords

Cite

Register to see more suggestions