Hybrid method for automated news content extraction from the Web

8Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Li, Y., Meng, X., Li, Q., & Wang, L. (2006). Hybrid method for automated news content extraction from the Web. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4255 LNCS, pp. 327–338). Springer Verlag. https://doi.org/10.1007/11912873_34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free