In this paper, we describe a novel text classifier that can effectively cope with structured documents. We report experiments that compare its performance with that of a wellknown probabilistic classifier. Our novel classifier can take advantage of the information in the structure of document that conventional, purely term-based classifiers ignore. Conventional classifiers are mostly based on the vector space model of document, which views a document simply as an n-dimensional vector of terms. To retain the information in the structure, we have developed a structured vector model, which represents a document with a structured vector, whose elements can be either terms or other structured vectors. With this extended model, we also have improved the well-known probabilistic classification method based on the Bernoulli document generation model. Our classifier based on these improvements performes significantly better on pre-classified samples from the web and the US Patent database than the usual classifiers.
CITATION STYLE
Yi, J., & Sundaresan, N. (2000). A classifier for semi-structured documents. In Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 340–344). Association for Computing Machinery (ACM). https://doi.org/10.1145/347090.347164
Mendeley helps you to discover research relevant for your work.