Sample-based XPath ranking for web information extraction

Oliver Jundt; Maurice Van Keulen

Conference Proceedings

Sample-based XPath ranking for web information extraction

8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013 - Advances in Intelligent Systems Research (2013) 32 187-194

DOI: 10.2991/eusflat.2013.27

2Citations

5Readers

Get full text

Abstract

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets automatic on-The-fly wrapper creation for websites that provide attribute data for objects in a 'search-search result page-detail page' setup. It is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute. © 2013. The authors-Published by Atlantis Press.

Author supplied keywords

Cite

CITATION STYLE

APA

Jundt, O., & Van Keulen, M. (2013). Sample-based XPath ranking for web information extraction. In 8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013 - Advances in Intelligent Systems Research (Vol. 32, pp. 187–194). https://doi.org/10.2991/eusflat.2013.27

Sample-based XPath ranking for web information extraction

Abstract

Author supplied keywords

Cite

Register to see more suggestions