Sign up & Download
Sign in

Databases on the Web: national web domain survey

by Denis Shestakov
IDEAS 2011 ()

Abstract

The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are disputable. In this paper, we address the problem of accurate estimation of the deepWeb by sampling one national web domain. We report some of our results ob- tained when surveying the Russian Web. The survey find- ings, namely the size estimates of the deep Web, could be useful for further studies to handle data in the deep Web.

Cite this document (BETA)

Available from Denis Shestakov's profile on Mendeley.
Page 1
hidden

Databases on the Web: national we...

Databases on the Web: national web domain survey Denis Shestakov Department of Media Technology, Aalto University Konemiehentie 2, Espoo, 02150 Finland denis.shestakov@aalto.fi ABSTRACT The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are disputable. In this paper, we address the problem of accurate estimation of the deep Web by sampling one national web domain. We report some of our results ob- tained when surveying the Russian Web. The survey find- ings, namely the size estimates of the deep Web, could be useful for further studies to handle data in the deep Web. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online In- formation Services���Web-based services H.3.7 [Information Storage and Retrieval]: Digital Libraries���Collection General Terms Measurement Keywords web databases, deep web, web characterization, web mea- surement, national web, structured data, cluster random sampling, virtual hosting 1. INTRODUCTION Dynamic pages generated based on parameters provided by a user via web search forms are poorly indexed by ma- jor web searchers and, hence, scarcely presented in searchers��� results. Such search interfaces provide web users with an on- line access to myriads of databases, contents of which com- prise a huge part of the Web known as the deep Web [17]. Since introducing structured web data to search results is one of the current priorities for web search engines such as Google or Microsoft Bing [12], there is a huge interest in bet- ter understanding of deep web resources, the main sources of structured data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IDEAS11 2011, September 21-23, Lisbon [Portugal] Editors: Bernardino, Cruz, Desai Copyright c circlecopyrt 2011 ACM 978-1-4503-0627-0/11/09 $10.00 Though the term deep Web was coined in 2000 [9], suffi- ciently long ago for any web-related concept, many impor- tant characteristics of the deep Web still remain unknown. For example, such parameter as the total number of search- able databases on the Web is highly disputable. In fact, until now there are only three works (namely, [9, 13, 19]) solely devoted to the deep web characterization and, more than that, one of these works is a white paper, where all findings were obtained by using proprietary methods. An- other matter of concern is that the mentioned surveys are based on approaches with inherent limitations. The most serious drawback is ignoring so called virtual hosting, i.e., the fact that multiple web sites can share the same IP ad- dress. Neglect of virtual hosting factor in earlier deep web surveys means that their estimates are highly biased. In this work, our goal is to get accurate characteristics of the deep Web by sampling one national web domain. In our characterization survey we use the Host-IP clustering ap- proach, that is based on the idea of clustering hosts sharing the same IPs and analyzing���neighbors by IP���hosts together. Usage of host-IP mapping data allows us to address draw- backs of previous surveys, specifically to take into account the virtual hosting factor. We obtain some rough estimates for the number of entities (or objects) in the analyzed na- tional web domain and argue that the size of the deep Web (measured in the number of searchable entities) is similar to the size of indexable Web. The next section gives a background on methods to char- acterize the deep Web. In Section 3 we present our approach, the Host-IP cluster sampling technique. The results of our survey of the Russian Web are described in Section 4. Dis- cussion and literature review are given in Sections 5 and 6 correspondingly. Finally, Section 7 concludes the paper. 2. BACKGROUND: DEEP WEB CHARAC- TERIZATION Existing attempts to characterize the deep Web [9, 13, 19] are based on two methods originally applied to general Web surveys: namely, overlap analysis [10] and random sampling of IP addresses [16]. The first technique involves pairwise comparisons of list- ings of deep web sites, where the overlap between each two sources is used to estimate the size of the deep Web (specif- ically, total number of deep web sites) [9]. The critical re- quirement to listings be independent from one another is un- feasible in practice, thus making the estimates produced by overlap analysis seriously biased. Additionally, the method is generally non-reproducible. 179
Page 2
hidden
IP 1 IP2 : responding : ..... IPk : responding : deep web site ..... IP n-1 IPn IP2 IPk IPn HTTP Request No Response ... HTTP Request HTTP Response IP1 HTTP Request HTTP Response HTTP Request No Response HTTP Request No Response IPn-1 IP2: Web site 1 (analyzed) Web site 2 (missed) Web site 3 (analyzed) Web site 4 (missed) Web site 5 (missed) ..... (missed) Web site 20 (missed) IPk: Web site 1 (analyzed) Web site 2 (missed) has one or more search interfaces Web server IP address ... no search interfaces Sample Figure 1: Random sampling of IP addresses method: sample of IPs (IP1, . . . , IPn) are tested for active web servers (with IP2 and IPk), which are then checked for the presence of interfaces to web databases due to inability to find out all web sites hosted on a par- ticular IP only three sites in total are analyzed while the rest is missed. Unlike the overlap analysis the second technique, the ran- dom sampling of IP addresses technique (rsIP for short), is easily reproducible and requires no pre-built listings. The rsIP estimates the total number of deep web sites by ana- lyzing a sample of unique IP (Internet Protocol) addresses randomly generated from the entire space of valid IPs and extrapolating the findings to the Web at large. Since the entire IP space is of finite size and every web site is hosted on one or several web servers, each with an IP address (such an address is not a unique identifier for a server though ��� a single server may use multiple IPs and, conversely, several servers can answer for the same IP), analyzing an IP sam- ple of adequate size can provide reliable estimates for the characteristics of the Web in question. In [13], one million unique randomly-selected IP addresses were scanned for ac- tive web servers by making an HTTP connection to each IP. Detected web servers were exhaustively crawled and those hosting deep web sites (defined as web sites with search in- terfaces, or search forms, that allow a user to search in un- derlying databases) were identified and counted. The tech- nique is depicted in Figure 1, where one deep web site is found to be hosted on a web server with IPk. Note that a main indicator of a deep web site is a functionality of search through the content of an underlying database(-s) rather than through (crawlable) content of web site���s pages. In this way, a deep web site and a database-driven web site are two different notions. Unfortunately the rsIP approach has several limitations. The most serious drawback is ignoring virtual hosting, i.e., the fact that multiple web sites can share the same IP ad- dress. This leads to ignoring a certain number of sites, some of which are apparently deep web sites. To illustrate, Fig- ure 1 shows that servers with IP2 and IPk host twenty and two web sites correspondingly, but only three out of 22 web sites are actually crawled to discover interfaces to web databases. The numbers of analyzed and missed sites per IP in this example are perfectly typical: the reverse IP procedure usually returns one or two web sites hosted on a given IP address, while hosting a lot of sites on the same IP is a common practice. Table 1 presents the average numbers of virtual hosts per IP address obtained in four web studies conducted in 2003-2007. The data clearly suggests that: (1) one IP address is, in average, shared by 7-11 hosts and (2) the number of hosts per IP increases over time [4, 14, 3]. Another factor overlooked by the rsIP method is DNS load balancing, i.e., the assignment of multiple IP addresses to a single web site. For instance, Russian news site newsru.com mapped to three (here and hereafter if not otherwise indi- cated, resolved in 05/2010) IPs is three times more likely to appear in a sample of random IPs than a site with one assigned IP. Since the DNS load balancing is the most ben- eficial for popular and highly trafficked web sites we expect that the bias caused by the load balancing is less than the bias due to the virtual hosting. Indeed, according to the SecuritySpace���s survey as of April 2004, only 4.7% of hosts had their names resolved to multiple IP addresses [5], while more than 90% of hosts shared the same IP with others (see the first row of Table 1). To summarize, the virtual hosting cannot be ignored in any IP-based sampling survey. Next we present the sampling strategy that addresses these challenges. 3. HOST-IP CLUSTERING TECHNIQUE Real-world web sites are hosted on several web servers, share their web servers with other sites, and are often acces- sible via multiple hostnames. Neglecting these issues makes estimates produced by IP-based or host-based sampling se- riously biased. The clue to a better sampling strategy lies in the fact that hostname aliases for a given web site are frequently mapped to the same IP address. In this way, given a hostname re- solved to some IP address, we can identify other hostnames potentially pointing to the same web content by checking other hostnames mapped to this IP. It is interesting to see here a strong resemblance to the virtual hosting problem, where all hosts sharing a given IP address have to be found. Assuming a large listing of hosts is available, we can acquire the knowledge about which hosts mapped to which IPs by resolving all hostnames in the listing to their corresponding IP addresses. Technically, such massive resolving of avail- able hosts to their IPs is essentially a process of clustering hosts into groups, each including hosts sharing the same IP address. Grouping hosts with the same IPs together is quite natural because it is exactly what happens on the Web, where a web server serves requests only to those hosts that are mapped to a server���s IP. Once the overall list of hosts is clustered by IPs we can apply a cluster sampling strategy, where an IP address is a primary sampling unit consisting of a cluster of secondary sampling units, hosts. Our Host-IP approach to characterization of deep Web consists of the following major steps: ��� Resolving, clustering and sampling: resolve a large num- ber of hosts relating to a studied web segment to their IP addresses, group hosts based on their IPs, and generate a sample of random IP addresses from all resolved IPs. ��� Crawling: for each sampled IP analyze hosts sharing a sampled IP for near-duplicates, remove near-duplicates and crawl the rest to a predefined depth. While crawling new hosts (which are not in the initial main list) may be found: those mapped to a sampled IP are to be analyzed, others are analyzed if belong to a studied web segment. 180

Authors on Mendeley

Readership Statistics

3 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
33% Student (Master)
 
33% Senior Lecturer
 
33% Post Doc
by Country
 
33% China
 
33% Finland
 
33% Pakistan

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in