Charset encoding detection of HTML documents: A practical experience

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

Cite

CITATION STYLE

APA

Faghani, S., Hadian, A., & Minaei-Bidgoli, B. (2015). Charset encoding detection of HTML documents: A practical experience. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9460, pp. 215–226). Springer Verlag. https://doi.org/10.1007/978-3-319-28940-3_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free