Abstract
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets automatic on-The-fly wrapper creation for websites that provide attribute data for objects in a 'search-search result page-detail page' setup. It is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute. © 2013. The authors-Published by Atlantis Press.
Author supplied keywords
Cite
CITATION STYLE
Jundt, O., & Van Keulen, M. (2013). Sample-based XPath ranking for web information extraction. In 8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013 - Advances in Intelligent Systems Research (Vol. 32, pp. 187–194). https://doi.org/10.2991/eusflat.2013.27
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.