Probabilistic description-oriented approach for categorizing Web documents

38Citations
Citations of this article
29Readers
Mendeley users who have this article in their library.

Abstract

The automatic categorization of web documents is becoming crucial for organizing the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers. Our categorization approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorization of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

Cite

CITATION STYLE

APA

Goevert, N., Lalmas, M., & Fuhr, N. (1999). Probabilistic description-oriented approach for categorizing Web documents. In International Conference on Information and Knowledge Management, Proceedings (pp. 475–482). ACM. https://doi.org/10.1145/319950.320053

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free