Mining semi-structured data by path expressions

5Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path α is measured by the agreement of a labelling function on given data, i.e., the number of paths on which α agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.

Cite

CITATION STYLE

APA

Taniguchi, K., Sakamoto, H., Arimura, H., Shimozono, S., & Arikawa, S. (2001). Mining semi-structured data by path expressions. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2226, pp. 378–388). Springer Verlag. https://doi.org/10.1007/3-540-45650-3_32

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free