Large-scale web page classification

Sathi T. Marath; Michael Shepherd; Evangelos Milios; Jack Duffy

Conference ProceedingsOPEN ACCESS

Large-scale web page classification

Proceedings of the Annual Hawaii International Conference on System Sciences (2014) 1813-1822

DOI: 10.1109/HICSS.2014.229

8Citations

25Readers

Abstract

This research investigates the design of a unified framework for the content-based classification of highly imbalanced hierarchical datasets, such as web directories. In an imbalanced dataset, the prior probability distribution of a category indicates the presence or absence of class imbalance. This may include the lack of positive training instances (rarity) or an overabundance of positive instances. We partitioned the subcategories of the Yahoo! web directory into five mutually exclusive groups based on the prior probability distribution. The best performing classification methods for a particular prior probability distribution were identified and used to design a content-based classification model for the complete (as of 2007) Yahoo! web directory of 639,671 categories and 4,140,629 web pages. The methodology was validated using a DMOZ subset of 17,217 categories and 130,594 web pages and we demonstrated statistically that the methodology of this research works equally well on large and small datasets. © 2014 IEEE.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Marath, S. T., Shepherd, M., Milios, E., & Duffy, J. (2014). Large-scale web page classification. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 1813–1822). IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.229

Readers' Seniority

PhD / Post grad / Masters / Doc 9

50%

Researcher 5

28%

Professor / Associate Prof. 3

17%

Lecturer / Post doc 1

Readers' Discipline

Computer Science 18

86%

Social Sciences 1

Arts and Humanities 1

Engineering 1

Large-scale web page classification

Abstract

References Powered by Scopus

Machine Learning in Automated Text Categorization

A re-examination of text categorization methods

An evaluation of statistical approaches to text categorization

Cited by Powered by Scopus

Using machine learning for web page classification in search engine optimization

An optimized approach for massive web page classification using entity similarity based on semantic network

Visual content-based web page categorization with deep transfer learning and metric learning

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline