User Browsing Behavior-driven Web Crawling
Page 2
User Browsing Behavior-driven Web Crawling
Figure 2: The overview of the proposed approach.
– First, our approach is capable of predicting the importance
of unseen URLs. URL patterns of a website are steady in
a relatively long period, and the user behaviors related to
a certain pattern won’t change too much.
– Second, our approach integrates the merits of both general
and site-level crawl ordering policies. General policies like
PageRank deal with a majority of websites but cannot
optimize the efficiency of a particular one; while site-level
policies [21, 22] are designed for specific websites but are
hard to be scaled up to the whole Web. By contrast,
our approach can go deep to optimize site-level crawl ef-
ficiency, as well as go wide to provide a general solution.
Two technical obstacles are solved to make the proposed
approach work:
– How to determine the granularities of URL patterns? Gen-
eral patterns are apt to mix up URLs with different char-
acteristics and cannot help crawling; while subtle patterns
have the risk of over-fitting which leads to poor general-
ization ability on unseen data. We propose to organize
URL patterns in a tree structure, and select appropriate
patterns from the tree through investigating the corre-
sponding user behaviors.
– How to properly leverage URL patterns to design crawl or-
dering policies? URLs from various patterns act different
roles in a website. We propose a behavior graph-based
solution to rank URL patterns for two common crawling
scenarios–comprehensive fetch a website and timely dis-
cover new pages.
Elaborate experiments have been carried out and the re-
sults are quite promising. First, the discovered URL pat-
terns can describe user behaviors very well. Furthermore,
the patterns are temporally reliable and with good gener-
alization ability to unseen data. Second, the crawling effi-
ciencies are noticeably improved. Under the same download
throughput, the proposed approach discovers and fetches
more informative pages than several traditional methods.
2. DATA AND FRAMEWORK OVERVIEW
The framework overview of the proposed approach is shown
in Fig. 2, which mainly consists of three components: (1) log
processing, (2) URL pattern discovery, and (3) pattern rank-
ing for crawl ordering.
We work on browse log collected from anonymous users
who have agreed to contribute their surfing history. Records
in the raw log data are in the form of a triple
(URLt, URLr, GUID)
where URLt is the target being browsed and URLr is the re-
ferrer the user comes from. GUID is a hexadecimal string to
identify anonymous users. Duplicate records (i.e. the same
user transit from URLr to URLt for multiple times) are re-
moved to avoid bias brought by individual users. Several
Figure 3: Illustrations of (a) URL decomposition
and (b) the pattern tree (only shows a partial part
for a clear view) for URLs from www.playlist.com.
measurements are statistically defined to characterize user
browsing behaviors:
– Fin(u). The frequency of a URL u being a visit target. It
is defined as the number of records whose URLt is u.
– Fout(u). The frequency of a URL u being a referrer. It is
defined as the number of records whose URLr is u.
– Ftrans(u, v). The frequency of transiting from a URL u
to v. It is defined as the number of records whose URLt
is v and URLr is u.
To discover appropriate URL patterns, as shown in Fig. 2,
we first construct a pattern tree to organize URLs based
on their syntax structures. The “parent-child” relationship
between tree nodes characterizes the syntax similarity very
well at various granularities. Then we tailor the tree through
merging child nodes to their parent if they have consistent
user behaviors.
Next, we analyze the statistical characteristics of user be-
haviors on each URL pattern. For example, how many URLs
in a pattern act as referrer in browsing? And what is the
frequency that users transiting from URLs in one pattern
to URLs in another? From such statistics we can estimate
how users browse a website, and analogically design crawl
ordering policies. Currently two crawling scenarios – com-
prehensive fetch and timely discovery – are considered and
their ordering policies are discussed in the modules ranking-
for-fetch and ranking-for-discovery in Fig. 2.
At last, we would like to emphasize that the proposed
approach can be run in parallel in multiple sandboxes, each
of which handles one website, as shown in Fig. 2. The only
input of a sandbox is the site-level logs. It is easy to deploy
this approach on a cluster to deal with millions of websites.
3. DISCOVERING URL PATTERNS
In this section, we introduce how to discover appropriate
URL patterns from web browse logs.
3.1 Syntax-based Pattern Tree Construction
A URL is not an ordinary string but complies with the
syntax scheme strictly defined in [1]. Based on the syntax
scheme, a URL can be decomposed into a series of “key–
value” pairs, as shown in Fig. 3 (a). In addition, URLs in
the same website usually follow some designing principles.
Specifically, different keys usually have different functions
and play different roles.
Following the recursive split process in [12], we construct
a pattern tree in a top-down manner. That is, we start from
the root (which contains all the input URLs), and iteratively
divide URLs into subgroups according to their values under
a particular key. The selected key in each iteration is the one
88
– First, our approach is capable of predicting the importance
of unseen URLs. URL patterns of a website are steady in
a relatively long period, and the user behaviors related to
a certain pattern won’t change too much.
– Second, our approach integrates the merits of both general
and site-level crawl ordering policies. General policies like
PageRank deal with a majority of websites but cannot
optimize the efficiency of a particular one; while site-level
policies [21, 22] are designed for specific websites but are
hard to be scaled up to the whole Web. By contrast,
our approach can go deep to optimize site-level crawl ef-
ficiency, as well as go wide to provide a general solution.
Two technical obstacles are solved to make the proposed
approach work:
– How to determine the granularities of URL patterns? Gen-
eral patterns are apt to mix up URLs with different char-
acteristics and cannot help crawling; while subtle patterns
have the risk of over-fitting which leads to poor general-
ization ability on unseen data. We propose to organize
URL patterns in a tree structure, and select appropriate
patterns from the tree through investigating the corre-
sponding user behaviors.
– How to properly leverage URL patterns to design crawl or-
dering policies? URLs from various patterns act different
roles in a website. We propose a behavior graph-based
solution to rank URL patterns for two common crawling
scenarios–comprehensive fetch a website and timely dis-
cover new pages.
Elaborate experiments have been carried out and the re-
sults are quite promising. First, the discovered URL pat-
terns can describe user behaviors very well. Furthermore,
the patterns are temporally reliable and with good gener-
alization ability to unseen data. Second, the crawling effi-
ciencies are noticeably improved. Under the same download
throughput, the proposed approach discovers and fetches
more informative pages than several traditional methods.
2. DATA AND FRAMEWORK OVERVIEW
The framework overview of the proposed approach is shown
in Fig. 2, which mainly consists of three components: (1) log
processing, (2) URL pattern discovery, and (3) pattern rank-
ing for crawl ordering.
We work on browse log collected from anonymous users
who have agreed to contribute their surfing history. Records
in the raw log data are in the form of a triple
(URLt, URLr, GUID)
where URLt is the target being browsed and URLr is the re-
ferrer the user comes from. GUID is a hexadecimal string to
identify anonymous users. Duplicate records (i.e. the same
user transit from URLr to URLt for multiple times) are re-
moved to avoid bias brought by individual users. Several
Figure 3: Illustrations of (a) URL decomposition
and (b) the pattern tree (only shows a partial part
for a clear view) for URLs from www.playlist.com.
measurements are statistically defined to characterize user
browsing behaviors:
– Fin(u). The frequency of a URL u being a visit target. It
is defined as the number of records whose URLt is u.
– Fout(u). The frequency of a URL u being a referrer. It is
defined as the number of records whose URLr is u.
– Ftrans(u, v). The frequency of transiting from a URL u
to v. It is defined as the number of records whose URLt
is v and URLr is u.
To discover appropriate URL patterns, as shown in Fig. 2,
we first construct a pattern tree to organize URLs based
on their syntax structures. The “parent-child” relationship
between tree nodes characterizes the syntax similarity very
well at various granularities. Then we tailor the tree through
merging child nodes to their parent if they have consistent
user behaviors.
Next, we analyze the statistical characteristics of user be-
haviors on each URL pattern. For example, how many URLs
in a pattern act as referrer in browsing? And what is the
frequency that users transiting from URLs in one pattern
to URLs in another? From such statistics we can estimate
how users browse a website, and analogically design crawl
ordering policies. Currently two crawling scenarios – com-
prehensive fetch and timely discovery – are considered and
their ordering policies are discussed in the modules ranking-
for-fetch and ranking-for-discovery in Fig. 2.
At last, we would like to emphasize that the proposed
approach can be run in parallel in multiple sandboxes, each
of which handles one website, as shown in Fig. 2. The only
input of a sandbox is the site-level logs. It is easy to deploy
this approach on a cluster to deal with millions of websites.
3. DISCOVERING URL PATTERNS
In this section, we introduce how to discover appropriate
URL patterns from web browse logs.
3.1 Syntax-based Pattern Tree Construction
A URL is not an ordinary string but complies with the
syntax scheme strictly defined in [1]. Based on the syntax
scheme, a URL can be decomposed into a series of “key–
value” pairs, as shown in Fig. 3 (a). In addition, URLs in
the same website usually follow some designing principles.
Specifically, different keys usually have different functions
and play different roles.
Following the recursive split process in [12], we construct
a pattern tree in a top-down manner. That is, we start from
the root (which contains all the input URLs), and iteratively
divide URLs into subgroups according to their values under
a particular key. The selected key in each iteration is the one
88
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
5 Readers on Mendeley
by Discipline
by Academic Status
60% Ph.D. Student
20% Other Professional
20% Professor
by Country
40% United States
20% South Korea
20% Germany


