The automatic categorization of web documents is becoming crucial for organizing the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers. Our categorization approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorization of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
CITATION STYLE
Goevert, N., Lalmas, M., & Fuhr, N. (1999). Probabilistic description-oriented approach for categorizing Web documents. In International Conference on Information and Knowledge Management, Proceedings (pp. 475–482). ACM. https://doi.org/10.1145/319950.320053
Mendeley helps you to discover research relevant for your work.