In many natural language processing tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document. This paper proposes a novel distributed vector representation of a document called DV-LSTM. It is derived from the result of adapting a long short-term memory recurrent neural network language model by the document. DV-LSTM is expected to capture some high-level sequential information in a document, which other current document representations fail to do. It was evaluated in document genre classification in the Brown Corpus , the BNC Baby Corpus, and the Penn Treebank Dataset. The results show that DV-LSTM significantly outperforms TF-IDF vector and paragraph vector (PV-DM) in most cases, and their combinations may further improve classification performance.
CITATION STYLE
Li, W., & Mak, B. (2017). Derivation of document vectors from adaptation of LSTM language model. In 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference (Vol. 2, pp. 456–461). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/e17-2073
Mendeley helps you to discover research relevant for your work.