Extracting the Main Content of Web Pages Using the First Impression Area

5Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-Textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers' (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks \mathrm {\mathbf {F-{0.5}}} ) and non-English (up to 42%, matched text blocks \mathrm {\mathbf {F-{0.5}}} ) web pages.

Cite

CITATION STYLE

APA

Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access, 10, 129958–129969. https://doi.org/10.1109/ACCESS.2022.3229080

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free