Filtering contents with bigrams and named entities to improve text classification

François Paradis; Jian Yun Nie

Conference Proceedings

Filtering contents with bigrams and named entities to improve text classification

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2005) 3689 LNCS 135-146

DOI: 10.1007/11562382_11

1Citations

7Readers

Get full text

Abstract

We present a new method for the classification of "noisy" documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the "4-Universities" collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Paradis, F., & Nie, J. Y. (2005). Filtering contents with bigrams and named entities to improve text classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3689 LNCS, pp. 135–146). https://doi.org/10.1007/11562382_11

Filtering contents with bigrams and named entities to improve text classification

Abstract

Cite

Register to see more suggestions