Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns - called boilerplates -, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections. © 2014 Springer International Publishing Switzerland.
CITATION STYLE
Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8416 LNCS, pp. 462–467). Springer Verlag. https://doi.org/10.1007/978-3-319-06028-6_42
Mendeley helps you to discover research relevant for your work.