Large-scale web page classification

8Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This research investigates the design of a unified framework for the content-based classification of highly imbalanced hierarchical datasets, such as web directories. In an imbalanced dataset, the prior probability distribution of a category indicates the presence or absence of class imbalance. This may include the lack of positive training instances (rarity) or an overabundance of positive instances. We partitioned the subcategories of the Yahoo! web directory into five mutually exclusive groups based on the prior probability distribution. The best performing classification methods for a particular prior probability distribution were identified and used to design a content-based classification model for the complete (as of 2007) Yahoo! web directory of 639,671 categories and 4,140,629 web pages. The methodology was validated using a DMOZ subset of 17,217 categories and 130,594 web pages and we demonstrated statistically that the methodology of this research works equally well on large and small datasets. © 2014 IEEE.

Cite

CITATION STYLE

APA

Marath, S. T., Shepherd, M., Milios, E., & Duffy, J. (2014). Large-scale web page classification. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 1813–1822). IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.229

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free