Knowing a web page by the company it keeps

  • Qi X
  • Davison B
  • 30

    Readers

    Mendeley users who have this article in their library.
  • 31

    Citations

    Citations of this article.

Abstract

Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.

Author-supplied keywords

  • neighboring
  • rainbow
  • svm
  • web page classification

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • Xiaoguang Qi

  • Brian D. Davison

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free