Webformer: Pre-training with Web Pages for Information Retrieval

Yu Guo; Zhengyi Ma; Jiaxin Mao; Hongjin Qian; Xinyu Zhang; Hao Jiang; Zhao Cao; Zhicheng Dou

Conference ProceedingsOPEN ACCESS

Webformer: Pre-training with Web Pages for Information Retrieval

SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022) 1502-1512

DOI: 10.1145/3477495.3532086

29Citations

16Readers

Get full text

Abstract

Pre-trained language models (PLMs) have achieved great success in the area of Information Retrieval. Studies show that applying these models to ad-hoc document ranking can achieve better retrieval effectiveness. However, on the Web, most information is organized in the form of HTML web pages. In addition to the pure text content, the structure of the content organized by HTML tags is also an important part of the information delivered on a web page. Currently, such structured information is totally ignored by pre-trained models which are trained solely based on text content. In this paper, we propose to leverage large-scale web pages and their DOM (Document Object Model) tree structures to pre-train models for information retrieval. We argue that using the hierarchical structure contained in web pages, we can get richer contextual information for training better language models. To exploit this kind of information, we devise four pre-training objectives based on the structure of web pages, then pre-train a Transformer model towards these tasks jointly with traditional masked language model objective. Experimental results on two authoritative ad-hoc retrieval datasets prove that our model can significantly improve ranking performance compared to existing pre-trained models.

Author supplied keywords

Cite

CITATION STYLE

APA

Guo, Y., Ma, Z., Mao, J., Qian, H., Zhang, X., Jiang, H., … Dou, Z. (2022). Webformer: Pre-training with Web Pages for Information Retrieval. In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1502–1512). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3532086

Webformer: Pre-training with Web Pages for Information Retrieval

Abstract

Author supplied keywords

Cite

Register to see more suggestions