The problem of Discourse Data Extraction focuses on identifying comments and reviews from social networking websites. Existing approaches for Discourse Data extraction are either template-dependent or they are limited to comment-posting-structure discovery. We are not aware of any technique that extracts the detailed comment information like comment text, commenter and discussion structure from the comment page. In this paper, we present a template-independent two step approach, namely TiDE, which extracts the discourse data such as comments, reviews, posts and structural relationship among them. In the first step, we parse the input comment page to prepare a Document Object Model tree and then find the location of discourse data in the tree using the concept of Path-Strings. The outputs of the first step are Comment Blocks and these Comment Blocks are leveraged in second step to extract the comments, commenter and discussion structure. Experimental studies on 19 well known Discourse websites having different templates show that our Comment Block discovery is more adaptable than the existing posting-structure discovery technique. We are able to extract 97 % of comment-text and 79 % commenter information which is significant compared to state of the art techniques. We also show the usefulness of TiDE by building a news comment crawler.
CITATION STYLE
Barua, J., Patel, D., & Goyal, V. (2015). TiDE: Template-independent discourse data extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9263, pp. 149–162). Springer Verlag. https://doi.org/10.1007/978-3-319-22729-0_12
Mendeley helps you to discover research relevant for your work.