Large-scale web page classification

8Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This research investigates the design of a unified framework for the content-based classification of highly imbalanced hierarchical datasets, such as web directories. In an imbalanced dataset, the prior probability distribution of a category indicates the presence or absence of class imbalance. This may include the lack of positive training instances (rarity) or an overabundance of positive instances. We partitioned the subcategories of the Yahoo! web directory into five mutually exclusive groups based on the prior probability distribution. The best performing classification methods for a particular prior probability distribution were identified and used to design a content-based classification model for the complete (as of 2007) Yahoo! web directory of 639,671 categories and 4,140,629 web pages. The methodology was validated using a DMOZ subset of 17,217 categories and 130,594 web pages and we demonstrated statistically that the methodology of this research works equally well on large and small datasets. © 2014 IEEE.

References Powered by Scopus

Machine Learning in Automated Text Categorization

6066Citations
N/AReaders
Get full text

A re-examination of text categorization methods

2149Citations
N/AReaders
Get full text

An evaluation of statistical approaches to text categorization

1494Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Using machine learning for web page classification in search engine optimization

39Citations
N/AReaders
Get full text

An optimized approach for massive web page classification using entity similarity based on semantic network

28Citations
N/AReaders
Get full text

Visual content-based web page categorization with deep transfer learning and metric learning

27Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Marath, S. T., Shepherd, M., Milios, E., & Duffy, J. (2014). Large-scale web page classification. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 1813–1822). IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.229

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 9

50%

Researcher 5

28%

Professor / Associate Prof. 3

17%

Lecturer / Post doc 1

6%

Readers' Discipline

Tooltip

Computer Science 18

86%

Social Sciences 1

5%

Arts and Humanities 1

5%

Engineering 1

5%

Save time finding and organizing research with Mendeley

Sign up for free