Thai related foreign language specific web crawling approach

Tanaphol Suebchua; Bundit Manaskasemsak; Arnon Rungsawang

Conference Proceedings

Thai related foreign language specific web crawling approach

Lecture Notes in Electrical Engineering (2014) 285 LNEE 641-648

DOI: 10.1007/978-981-4585-18-7_72

1Citations

3Readers

Get full text

Abstract

National web archives have been successfully made available through domain-and language-specific web crawlers for years. We here propose another focused web crawler for collecting foreign language web pages that are also related to a nation. Rather finding the most relevant web pages, an ensemble machine learning has been trained with selective features to find relevant clusters of unvisited web pages, called website segments. During consecutive crawling cycles, the machine will be retrained with features extracted from new found website segments. Preliminary experiments in the real web space on Thai-tourism related topics show that this approach can take advantage of recent crawling experiences to produce more promising harvest rates than traditional breadth- and best-first baselines. © Springer Science+Business Media Singapore 2014.

Author supplied keywords

Cite

CITATION STYLE

APA

Suebchua, T., Manaskasemsak, B., & Rungsawang, A. (2014). Thai related foreign language specific web crawling approach. In Lecture Notes in Electrical Engineering (Vol. 285 LNEE, pp. 641–648). Springer Verlag. https://doi.org/10.1007/978-981-4585-18-7_72

Thai related foreign language specific web crawling approach

Abstract

Author supplied keywords

Cite

Register to see more suggestions