In many research works, topical priorities of unvisited hyperlinks are computed based on linearly integrating topic-relevant similarities of various texts and corresponding weighted factors. However, these weighted factors are determined based on the personal experience, so that these values may make topical priorities of unvisited hyperlinks serious deviations directly. To solve this problem, this paper proposes a novel focused crawler applying the cell-like membrane computing optimization algorithm (CMCFC). The CMCFC regards all weighted factors corresponding to contribution degrees of similarities of various texts as one object, and utilizes evolution regulars and communication regulars in membranes to achieve the optimal object corresponding to the optimal weighted factors, which make the root measure square error (RMS) of priorities of hyperlinks achieve the minimum. Then, it linearly integrates optimal weighted factors and corresponding topical similarities of various texts, which are computed by using a Vector Space Model (VSM), to compute priorities of unvisited hyperlinks. The CMCFC obtains more accurate unvisited URLs' priorities to guide crawlers to collect higher quality web pages. The experimental results indicate that the proposed method improves the performance of focused crawlers by intelligently determining weighted factors. In conclusion, the mentioned approach is effective and significant for focused crawlers. © 2013 Elsevier B.V.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below