We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. © 2005 Association for Computational Linguistics.
CITATION STYLE
Yoshida, M., & Nakagawa, H. (2005). Reformatting web documents via header trees. In ACL-05 - 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 121–124). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1225753.1225784
Mendeley helps you to discover research relevant for your work.