A Dynamic Approach for Template and Content Extraction in Websites

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web scraping is a technique used to extract data from websites and it is the pillar of information retrieval in a world wide web that is ever growing. There are two main ways of extracting data from a website: static and dynamic scraping. Static scraping requires input beyond the target website because the user needs to inspect the HTML content of the target and find certain patterns in the templates that are then used to extract data. Static scraping is also very vulnerable to changes in the template of the web page. Dynamic scraping is a very broad topic and it has been tackled from many different angles: tree-based, natural language processing (NLP), computer vision or machine learning techniques. For most websites, the problem can be broken in two big steps: finding the template for the pages we want to extract data from and then removing irrelevant text such as ads, text from controls or JavaScript code. This paper proposes a solution for dynamic scraping that uses AngleSharp for HTML retrieval and involves a slightly modified approach of the graph technique mentioned in for template finding. Once we find a number of pages then several heuristics can be applied for content extraction and noise filtering. Such heuristics can include: text and hyperlink density, but also removing common content between multiple pages (usually text from controls, static JavaScript) and then of final layer of NLP techniques (breaking the content into sentences, tokenization and part-of-speech tagging).

Cite

CITATION STYLE

APA

Cristian-Catalin, N., & Dragan, M. (2020). A Dynamic Approach for Template and Content Extraction in Websites. In Advances in Intelligent Systems and Computing (Vol. 1159 AISC, pp. 15–20). Springer. https://doi.org/10.1007/978-3-030-45688-7_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free