An FW-DTSS based approach for news page information extraction

Leiming Ma; Zhengyou Xia

Journal Article

An FW-DTSS based approach for news page information extraction

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9714 LNCS 227-234

DOI: 10.1007/978-3-319-40973-3_22

3Citations

3Readers

Get full text

Abstract

Automatically identifying and extracting main text from a news page becomes a critical task in many web content analysis applications with the explosive growth of News information. However, body contents are usually covered by presentation elements, such as dynamic flashing logos, navigational menus and a multitude of ad blocks. In this paper, we have proposed a function word (FW) based approach which involves the concept of DOM tree structure similarity (DTSS). Function words are the word that have no real meaning but semantic or functional meaning. Experiment statistics show that function words emerge a lot in main text, while they don’t appear or appear just once or twice in presentation elements. Our approach involves three separate stages. Stage 1 is learning stages. In stage 2, the number of function words in each paragraph is counted and then the paragraph having the most function words is chosen to be the sample. In stage 3, all body paragraphs are extracted according to their similarity with the sample paragraph in DOM tree structure. Experiments results on real world data show that the FW-DTSS based approach is excellent in efficiency and accuracy, compared with that of statistics-based and Vision-based approaches.

Author supplied keywords

Cite

CITATION STYLE

APA

Ma, L., & Xia, Z. (2016). An FW-DTSS based approach for news page information extraction. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9714 LNCS, 227–234. https://doi.org/10.1007/978-3-319-40973-3_22

An FW-DTSS based approach for news page information extraction

Abstract

Author supplied keywords

Cite

Register to see more suggestions