Scalability of the Nutch search engine

  • Moreira J
  • Michael M
  • Da Silva D
 et al. 
  • 39


    Mendeley users who have this article in their library.
  • 18


    Citations of this article.


Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • José E. Moreira

  • Maged M. Michael

  • Dilma Da Silva

  • Doron Shiloach

  • Parijat Dube

  • Li Zhang

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free