Boilerplate detection and recoding

Matthias Gallé; Jean Michel Renders

Conference Proceedings

Boilerplate detection and recoding

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8416 LNCS 462-467

DOI: 10.1007/978-3-319-06028-6_42

1Citations

7Readers

Get full text

Abstract

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns - called boilerplates -, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections. © 2014 Springer International Publishing Switzerland.

Cite

CITATION STYLE

APA

Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8416 LNCS, pp. 462–467). Springer Verlag. https://doi.org/10.1007/978-3-319-06028-6_42

Boilerplate detection and recoding

Abstract

Cite

Register to see more suggestions