Sample-based XPath ranking for web information extraction

2Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets automatic on-The-fly wrapper creation for websites that provide attribute data for objects in a 'search-search result page-detail page' setup. It is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute. © 2013. The authors-Published by Atlantis Press.

Cite

CITATION STYLE

APA

Jundt, O., & Van Keulen, M. (2013). Sample-based XPath ranking for web information extraction. In 8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013 - Advances in Intelligent Systems Research (Vol. 32, pp. 187–194). https://doi.org/10.2991/eusflat.2013.27

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free