Abstract
A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path α is measured by the agreement of a labelling function on given data, i.e., the number of paths on which α agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.
Cite
CITATION STYLE
Taniguchi, K., Sakamoto, H., Arimura, H., Shimozono, S., & Arikawa, S. (2001). Mining semi-structured data by path expressions. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2226, pp. 378–388). Springer Verlag. https://doi.org/10.1007/3-540-45650-3_32
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.