Charset encoding detection of HTML documents: A practical experience

Shabanali Faghani; Ali Hadian; Behrouz Minaei-Bidgoli

Conference Proceedings

Charset encoding detection of HTML documents: A practical experience

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9460 215-226

DOI: 10.1007/978-3-319-28940-3_17

0Citations

4Readers

Get full text

Abstract

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

Author supplied keywords

Cite

CITATION STYLE

APA

Faghani, S., Hadian, A., & Minaei-Bidgoli, B. (2015). Charset encoding detection of HTML documents: A practical experience. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9460, pp. 215–226). Springer Verlag. https://doi.org/10.1007/978-3-319-28940-3_17

Charset encoding detection of HTML documents: A practical experience

Abstract

Author supplied keywords

Cite

Register to see more suggestions