A classifier for semi-structured documents

72Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we describe a novel text classifier that can effectively cope with structured documents. We report experiments that compare its performance with that of a wellknown probabilistic classifier. Our novel classifier can take advantage of the information in the structure of document that conventional, purely term-based classifiers ignore. Conventional classifiers are mostly based on the vector space model of document, which views a document simply as an n-dimensional vector of terms. To retain the information in the structure, we have developed a structured vector model, which represents a document with a structured vector, whose elements can be either terms or other structured vectors. With this extended model, we also have improved the well-known probabilistic classification method based on the Bernoulli document generation model. Our classifier based on these improvements performes significantly better on pre-classified samples from the web and the US Patent database than the usual classifiers.

Cite

CITATION STYLE

APA

Yi, J., & Sundaresan, N. (2000). A classifier for semi-structured documents. In Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 340–344). Association for Computing Machinery (ACM). https://doi.org/10.1145/347090.347164

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free