Pattern Matching-based scraping of news websites

Hamza Salem; Manuel Mazzara

Conference ProceedingsOPEN ACCESS

Pattern Matching-based scraping of news websites

Journal of Physics: Conference Series (2020) 1694(1)

DOI: 10.1088/1742-6596/1694/1/012011

9Citations

26Readers

Abstract

Web Scraping is the process of extracting content from human-readable websites in order to import it into local storage such as databases or CSV Files. The process of data extraction and its design is time-consuming requiring an analysis of the website, data representation of the objects comprising its structure (DOM), HTML tags, and the Cascading Style Sheets (CSS) classes. To support this process we aim at providing automation. In this paper, we propose a pattern mining technique to scrap news and blog websites by recognizing title and body based on a content structure pattern. This approach consists of three steps, i.e.: extracting news website structure, constructing a pattern of HTML content, and implementing the pattern as a set of rules in web scraping. Our approach is a simple, general, and straightforward way to extract articles that consist of the title, the body of any blogs, or news websites.

Author supplied keywords

Cite

CITATION STYLE

APA

Salem, H., & Mazzara, M. (2020). Pattern Matching-based scraping of news websites. In Journal of Physics: Conference Series (Vol. 1694). IOP Publishing Ltd. https://doi.org/10.1088/1742-6596/1694/1/012011

Pattern Matching-based scraping of news websites

Abstract

Author supplied keywords

Cite

Register to see more suggestions