WebDP: Understanding Discourse Structures in Semi-Structured Web Documents

1Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web documents have become rich data resources in current era, and understanding their discourse structure will potentially benefit various downstream document processing applications. Unfortunately, current discourse analysis and document intelligence research mostly focus on either discourse structure of plain text or superficial visual structures in document, which cannot accurately describe discourse structure of highly free-styled and semi-structured web documents. To promote discourse studies on web documents, in this paper we introduced a benchmark - WebDP, orienting a new task named Web Document Discourse Parsing. Specifically, a web document discourse structure representation schema is proposed by extending classical discourse theories and adding special features to well represent discourse characteristics of web documents. Then, a manually annotated web document dataset -WEBDOCS is developed to facilitate the study of this parsing task. We compared current neural models on WEBDOCS and experimental results show that WebDP is feasible but also challenging for current models.

Cite

CITATION STYLE

APA

Liu, P., Lin, H., Liao, M., Xiang, H., Han, X., & Sun, L. (2023). WebDP: Understanding Discourse Structures in Semi-Structured Web Documents. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 10235–10258). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.650

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free