Coping with noise in a real-world weblog crawler and retrieval system

0Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and discover that the time-interval between crawls is more important to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Cite

CITATION STYLE

APA

Lanagan, J., Ferguson, P., O’Hare, N., & Smeaton, A. F. (2010). Coping with noise in a real-world weblog crawler and retrieval system. In ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (pp. 271–274). https://doi.org/10.1609/icwsm.v4i1.14040

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free