WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

Chenhao Xie; Wenhao Huang; Jiaqing Liang; Chengsong Huang; Yanghua Xiao

Conference ProceedingsOPEN ACCESS

WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

International Conference on Information and Knowledge Management, Proceedings (2021) 2211-2220

DOI: 10.1145/3459637.3482491

12Citations

23Readers

Get full text

Abstract

The World Wide Web contains rich up-to-date information for knowledge graph construction. However, most current relation extraction techniques are designed for free text and thus do not handle well semi-structured web content. In this paper, we propose a novel multi-phase machine reading framework, called WebKE. It processes the web content on different granularity by first detecting areas of interest at DOM tree node level and then extracting relational triples for each area. We also propose HTMLBERT as an encoder the web content. It is a pre-trained markup language model that fully leverages the visual layout information and DOM-tree structure, without the need of hand engineered features. Experimental results show that the proposed approach outperforms state-of- the-art methods by a considerable gain. The source code is available at https://github.com/redreamality/webke.

Author supplied keywords

Cite

CITATION STYLE

APA

Xie, C., Huang, W., Liang, J., Huang, C., & Xiao, Y. (2021). WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model. In International Conference on Information and Knowledge Management, Proceedings (pp. 2211–2220). Association for Computing Machinery. https://doi.org/10.1145/3459637.3482491

WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

Abstract

Author supplied keywords

Cite

Register to see more suggestions