In recent years, much work has been invested into automatically learning wrappers for information extraction from HTML tables and lists. Our research has focused on a system that can learn a wrapper from a single unlabelled page. An essential step is to locate the tabular data within the page. This is not trivial when the structures of data tuples are similar but not identical. In this paper we describe an algorithm that can automatically detect approximate repetitive structures within one sequence. The algorithm does not rely on any domain knowledge or HTML heuristics and it can be used in detecting repetitive patterns and hence to learn wrappers from a single unlabeled tabular page. © Springer-Verlag Berlin Heidelberg 2004.
CITATION STYLE
Gao, X., Andreae, P., & Collins, R. (2004). Approximately repetitive structure detection for wrapper induction. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (Vol. 3157, pp. 585–594). Springer Verlag. https://doi.org/10.1007/978-3-540-28633-2_62
Mendeley helps you to discover research relevant for your work.