Abstract
Pre-trained language models (PLMs) have achieved great success in the area of Information Retrieval. Studies show that applying these models to ad-hoc document ranking can achieve better retrieval effectiveness. However, on the Web, most information is organized in the form of HTML web pages. In addition to the pure text content, the structure of the content organized by HTML tags is also an important part of the information delivered on a web page. Currently, such structured information is totally ignored by pre-trained models which are trained solely based on text content. In this paper, we propose to leverage large-scale web pages and their DOM (Document Object Model) tree structures to pre-train models for information retrieval. We argue that using the hierarchical structure contained in web pages, we can get richer contextual information for training better language models. To exploit this kind of information, we devise four pre-training objectives based on the structure of web pages, then pre-train a Transformer model towards these tasks jointly with traditional masked language model objective. Experimental results on two authoritative ad-hoc retrieval datasets prove that our model can significantly improve ranking performance compared to existing pre-trained models.
Author supplied keywords
Cite
CITATION STYLE
Guo, Y., Ma, Z., Mao, J., Qian, H., Zhang, X., Jiang, H., … Dou, Z. (2022). Webformer: Pre-training with Web Pages for Information Retrieval. In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1502–1512). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3532086
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.