Unsupervised multi-lingual language modeling has gained attraction in the last few years and poly-lingual topic models provide a mechanism to learn aligned document representations. However, training such models require translation-aligned data across languages, which is not always available. Also, in case of short texts like tweets, search queries, etc, the training of topic models continues to be a challenge. In this work, we present a novel strategy of creating a pseudo-parallel dataset followed by training topic models for sponsored search retrieval, that also mitigates the short text challenge. Our data augmentation strategy leverages easily available bipartite click-though graph that allows us to draw similar documents in different languages. The proposed methodology is evaluated on sponsored search system whose performance is measured on correctly matching the user intent, presented via the query, with ads provided by the advertiser. Our experiments substantiate the goodness of the method on EuroParl dataset and live search-engine traffic.
CITATION STYLE
Jain, A. K., Arora, G., & Agrawal, R. (2020). Graph Regularization for Multi-lingual Topic Models. In SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1741–1744). Association for Computing Machinery, Inc. https://doi.org/10.1145/3397271.3401231
Mendeley helps you to discover research relevant for your work.