Boilerplate detection and recoding

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns - called boilerplates -, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections. © 2014 Springer International Publishing Switzerland.

Cite

CITATION STYLE

APA

Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8416 LNCS, pp. 462–467). Springer Verlag. https://doi.org/10.1007/978-3-319-06028-6_42

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free