National web archives have been successfully made available through domain-and language-specific web crawlers for years. We here propose another focused web crawler for collecting foreign language web pages that are also related to a nation. Rather finding the most relevant web pages, an ensemble machine learning has been trained with selective features to find relevant clusters of unvisited web pages, called website segments. During consecutive crawling cycles, the machine will be retrained with features extracted from new found website segments. Preliminary experiments in the real web space on Thai-tourism related topics show that this approach can take advantage of recent crawling experiences to produce more promising harvest rates than traditional breadth- and best-first baselines. © Springer Science+Business Media Singapore 2014.
CITATION STYLE
Suebchua, T., Manaskasemsak, B., & Rungsawang, A. (2014). Thai related foreign language specific web crawling approach. In Lecture Notes in Electrical Engineering (Vol. 285 LNEE, pp. 641–648). Springer Verlag. https://doi.org/10.1007/978-981-4585-18-7_72
Mendeley helps you to discover research relevant for your work.