Host-IP Clustering Technique for ...
Host-IP Clustering Technique for Deep Web Characterization Denis Shestakov Department of Media Technology Helsinki University of Technology PL 5400, Finland-02015 TKK denis.shestakov@tkk.fi Tapio Salakoski Department of Infomation Technology University of Turku Finland-20014 Turun yliopisto Abstract���A huge portion of todays Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web is somewhat disputable. In this paper, we are aimed at more accurate estimation of main parameters of the deep Web by sampling one national web domain. We propose the Host-IP clustering sampling technique that addresses drawbacks of existing ap- proaches to characterize the deep Web and report our findings based on the survey of Russian Web conducted in September 2006. Obtained estimates together with a proposed sampling method could be useful for further studies to handle data in the deep Web. Keywords-deep web web characterization virtual hosting host-IP clustering sampling search interface discovery I. INTRODUCTION Dynamic pages generated based on parameters provided by a user via web search forms are poorly indexed by major web searchers and, hence, scarcely presented in searchers results. Such search interfaces provide web users with an online access to myriads of databases, contents of which comprise a huge part of the Web known as the deep Web [1]. Though the term deep Web was coined in 2000 [2], suf- ficiently long ago for any web-related concept/technology, many important characteristics of the deep Web still remain unknown. For example, such parameter as the total number of searchable databases on the Web is highly disputable. In fact, until now there are only three works (namely, [2], [3], [4]) solely devoted to the deep web characterization and, more than that, one of these works is a white paper, where all findings were obtained by using proprietary methods. Another matter of concern is that the mentioned surveys are based on approaches with inherent limitations. The most serious drawback is ignoring so called virtual hosting, i.e., the fact that multiple web sites can share the same IP address. Neglect of virtual hosting factor means that the estimates produced by existing deep web surveys are highly biased. In this work, our goal is to propose better technique for deep web characterization. Our approach is based on the idea of clustering hosts sharing the same IPs and analyzing ���neighbors by IP��� hosts together. Usage of host-IP mapping data allows us to address drawbacks of previous surveys, specifically to take into account the virtual hosting factor. The next section gives a background on methods to characterize the deep Web. In Section III we present our approach, the Host-IP cluster sampling technique. The ex- periments and results of our survey of the Russian Web are described in Section IV. Finally, Section V concludes the paper. II. BACKGROUND: DEEP WEB CHARACTERIZATION Existing attempts to characterize the deep Web [2], [3], [4] are based on two methods originally applied to general Web surveys: namely, overlap analysis [5] and random sampling of IP addresses [6]. The first technique involves pairwise comparisons of listings of deep web sites, where the overlap between each two sources is used to estimate the size of the deep Web (specifically, total number of deep web sites) [2]. The critical requirement to listings be independent from one another is unfeasible in practice, thus making the estimates produced by overlap analysis seriously biased. Additionally, the method is generally non-reproducible. Unlike the overlap analysis the second technique, the random sampling of IP addresses technique (rsIP for short), is easily reproducible and requires no pre-built listings. The rsIP estimates the total number of deep web sites by analyzing a sample of unique IP (Internet Protocol) addresses randomly generated from the entire space of valid IPs and extrapolating the findings to the Web at large. Since the entire IP space is of finite size and every web site is hosted on one or several web servers, each with an IP address1, analyzing an IP sample of adequate size can provide reliable estimates for the characteristics of the Web in question. In [3], one million unique randomly-selected IP addresses were scanned for active web servers by making an HTTP connection to each IP. Detected web servers were exhaustively crawled and those hosting deep web sites (i.e., web sites with at least one search interface to a database) were identified and counted. Unfortunately the rsIP approach 1An IP address is not a unique identifier for a web server as a single server may use multiple IPs and, conversely, several servers can answer for the same IP. 2010 12th International Asia-Pacific Web Conference 978-0-7695-4012-2/10 $26.00 �� 2010 IEEE DOI 10.1109/APWeb.2010.59 378