Measuring the semantic similarities between documents is an important issue because it is the basis for many applications, such as document summarization, web search, text analysis, and so forth. Although many studies have explored this problem through enriching the document vectors based on the relatedness of the words involved, the performance is still far from satisfaction because of the insufficiency of data, i.e., the sparse and anomalous co-occurrences between words. The insufficient data can only generate unreliable relatedness between words. In this paper, we propose an effective approach to correct the unreliable relatedness, which keeps the joint probabilities of the co-occurrences between each word and themselves consistently equal to their occurrence probabilities throughout the generation of the relatedness. Hence the unreliable relatedness is effectively corrected by referring to the occurrence frequencies of the words, which is confirmed theoretically and experimentally. The thorough evaluation conducted on real datasets illustrates that significant improvement has been achieved on document clustering compared with the state-of-the-art methods.
CITATION STYLE
Wei, Y., Wei, J., Yang, Z., & Liu, Y. (2016). Joint probability consistent relation analysis for document representation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9642, pp. 517–532). Springer Verlag. https://doi.org/10.1007/978-3-319-32025-0_32
Mendeley helps you to discover research relevant for your work.