With the explosive growth of data in the Internet, the single vertical crawler cannot meet the requirements of the high performance of the crawler. The existing distributed vertical crawlers also have the problem of weak capability of customization. In order to solve the above problem, this paper proposes a distributed vertical crawler named ChainMR Crawler. We adopt ChainMapper/Chain‐ Reducer model to design each module of the crawler, use Redis to manage URLs and choose the distributed database Hbase to store the key content of web pages. Experimental results demonstrate that the efficiency of ChainMR Crawler is 6 % higher than Nutch in the field of vertical crawler, which achieves the expected effect.
CITATION STYLE
Liu, X., & Jin, Z. (2016). ChainMR crawler: A distributed vertical crawler based on mapreduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10067 LNCS, pp. 33–39). Springer Verlag. https://doi.org/10.1007/978-3-319-49145-5_4
Mendeley helps you to discover research relevant for your work.