This research investigates the design of a unified framework for the content-based classification of highly imbalanced hierarchical datasets, such as web directories. In an imbalanced dataset, the prior probability distribution of a category indicates the presence or absence of class imbalance. This may include the lack of positive training instances (rarity) or an overabundance of positive instances. We partitioned the subcategories of the Yahoo! web directory into five mutually exclusive groups based on the prior probability distribution. The best performing classification methods for a particular prior probability distribution were identified and used to design a content-based classification model for the complete (as of 2007) Yahoo! web directory of 639,671 categories and 4,140,629 web pages. The methodology was validated using a DMOZ subset of 17,217 categories and 130,594 web pages and we demonstrated statistically that the methodology of this research works equally well on large and small datasets. © 2014 IEEE.
CITATION STYLE
Marath, S. T., Shepherd, M., Milios, E., & Duffy, J. (2014). Large-scale web page classification. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 1813–1822). IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.229
Mendeley helps you to discover research relevant for your work.