Learning Discrete Document Representations in Web Search

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Product quantization (PQ) has been usually applied to dense retrieval (DR) of documents thanks to its competitive time, memory efficiency and compatibility with other approximate nearest search (ANN) methods. Originally, PQ was learned to minimize the reconstruction loss, i.e., the distortions between the original dense embeddings and the reconstructed embeddings after quantization. Unfortunately, such an objective is inconsistent with the goal of selecting ground-truth documents for the input query, which may cause a severe loss of retrieval quality. Recent research has primarily concentrated on jointly training the biencoders and PQ to ensure consistency for improved performance. However, it is still difficult to design an approach that can cope with challenges like discrete representation collapse, mining informative negatives, and deploying effective embedding-based retrieval (EBR) systems in a real search engine. In this paper, we propose a Two-stage Multi-task Joint training technique (TMJ) to learn discrete document representations, which is simple and effective for real-world practical applications. In the first stage, the PQ centroid embeddings are regularized by the dense retrieval loss, which ensures the distinguishability of the quantized vectors and preserves the retrieval quality of dense embeddings. In the second stage, a PQ-oriented sample mining strategy is introduced to explore more informative negatives and further improve the performance. Offline evaluations are performed on a public benchmark (MS MARCO) and two private real-world web search datasets, where our method notably outperforms the SOTA PQ methods both in Recall and Mean Reciprocal Ranking (MRR). Besides, online experiments are conducted to validate that our technique can significantly provide high-quality vector quantization. Moreover, our joint training framework has been successfully applied to a billion-scale web search system.

Cite

CITATION STYLE

APA

Huang, R., Zhang, D., Lu, W., Li, H., Wang, M., Shi, D., … Yin, D. (2023). Learning Discrete Document Representations in Web Search. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4185–4194). Association for Computing Machinery. https://doi.org/10.1145/3580305.3599854

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free