Web genre identification can boost information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. The open-set scenario is more realistic for this task as web genres evolve over time and it is not feasible to define a universally agreed genre palette. In this work, we bring to bear a novel approach to web genre identification underpinned by distributional features acquired by doc2vec and a recently-proposed open-set classification algorithm—the nearest neighbors distance ratio classifier. We present experimental results using a benchmark corpus and a strong baseline and demonstrate that the proposed approach is highly competitive, especially when emphasis is given on precision.
CITATION STYLE
Pritsos, D., Rocha, A., & Stamatatos, E. (2019). Open-set web genre identification using distributional features and nearest neighbors distance ratio. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11438 LNCS, pp. 3–11). Springer Verlag. https://doi.org/10.1007/978-3-030-15719-7_1
Mendeley helps you to discover research relevant for your work.