Judging a site by its content: learning the textual, structural, and visual features of malicious Web pages

  • Bannur S
  • Saul L
  • Savage S
  • 39


    Mendeley users who have this article in their library.
  • 9


    Citations of this article.


The physical world is rife with cues that allow us to distinguish between safe and unsafe situations. By contrast, the Internet of- fers a much more ambiguous environment; hence many users are unable to distinguish a scam from a legitimate Web page. To help address this problem, we explore how to train classifiers that can automatically identify malicious Web pages based on clues from their textual content, structural tags, page links, visual appearance, and URLs. Using a contemporary labeled data feed from a large Web mail provider, we extract such features and demonstrate how they can be used to improve classification accuracy over previous, more constrained approaches. In particular, by analyzing the full content of individualWeb pages, we more than halve the error rate obtained by a comparably trained classifier that only extracts fea- tures from URLs. By training classifiers on different sets of fea- tures, we are further able to assess the strength of clues provided by these different sources of information.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text


  • Sushma Nagesh Bannur

  • Lawrence K. Saul

  • Stefan Savage

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free