Learning Discrete Document Representations in Web Search

Rong Huang; Danfeng Zhang; Weixue Lu; Han Li; Meng Wang; Daiting Shi; Jun Fan; Zhicong Cheng; Simiu Gu; Dawei Yin

Conference ProceedingsOPEN ACCESS

Learning Discrete Document Representations in Web Search

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2023) 4185-4194

DOI: 10.1145/3580305.3599854

2Citations

8Readers

Get full text

Abstract

Product quantization (PQ) has been usually applied to dense retrieval (DR) of documents thanks to its competitive time, memory efficiency and compatibility with other approximate nearest search (ANN) methods. Originally, PQ was learned to minimize the reconstruction loss, i.e., the distortions between the original dense embeddings and the reconstructed embeddings after quantization. Unfortunately, such an objective is inconsistent with the goal of selecting ground-truth documents for the input query, which may cause a severe loss of retrieval quality. Recent research has primarily concentrated on jointly training the biencoders and PQ to ensure consistency for improved performance. However, it is still difficult to design an approach that can cope with challenges like discrete representation collapse, mining informative negatives, and deploying effective embedding-based retrieval (EBR) systems in a real search engine. In this paper, we propose a Two-stage Multi-task Joint training technique (TMJ) to learn discrete document representations, which is simple and effective for real-world practical applications. In the first stage, the PQ centroid embeddings are regularized by the dense retrieval loss, which ensures the distinguishability of the quantized vectors and preserves the retrieval quality of dense embeddings. In the second stage, a PQ-oriented sample mining strategy is introduced to explore more informative negatives and further improve the performance. Offline evaluations are performed on a public benchmark (MS MARCO) and two private real-world web search datasets, where our method notably outperforms the SOTA PQ methods both in Recall and Mean Reciprocal Ranking (MRR). Besides, online experiments are conducted to validate that our technique can significantly provide high-quality vector quantization. Moreover, our joint training framework has been successfully applied to a billion-scale web search system.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, R., Zhang, D., Lu, W., Li, H., Wang, M., Shi, D., … Yin, D. (2023). Learning Discrete Document Representations in Web Search. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4185–4194). Association for Computing Machinery. https://doi.org/10.1145/3580305.3599854

Learning Discrete Document Representations in Web Search

Abstract

Author supplied keywords

Cite

Register to see more suggestions