Algorithm for clustering of web search results from a hyper-heuristic approach

2Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The clustering of web search results - or web document clustering (WDC) - has become a very interesting research area among academic and scientific communities involved in information retrieval. Systems for the clustering of web search results, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering of web results already exist, but results show there is room for more to be done. This paper introduces a hyper-heuristic framework called WDC-HH, which allows the defining of the best algorithm for WDC. The hyper-heuristic framework uses four high-level-heuristics (performance-based rank selection, tabu selection, random selection and performance-based roulette wheel selection) for selecting low-level heuristics (used to solve the specific problem of WDC). As a low level heuristics the framework considers: harmony search, improved harmony search, novel global harmony search, global-best harmony search, eighteen genetic algorithm variations, particle swarm optimization, artificial bee colony, and differential evolution. The framework uses the k-means algorithm as a local solution improvement strategy and based on the Balanced Bayesian Information Criterion it is able to automatically define the appropriate number of clusters. The framework also uses four acceptance/replacement strategies (replacement heuristics): Replace the worst, Restricted Competition Replacement, Stochastic Replacement and Rank Replacement. WDC-HH was tested with four data sets using a total of 447 queries with their ideal solutions. As a main result of the framework assessment, a new algorithm based on global-best harmony search and rank replacement strategy obtained the best results in WDC problem. This new algorithm was called WDC-HH-BHRK and was also compared against other established WDC algorithms, among them: Suffix Tree Clustering (STC) and Lingo. Results show a considerable improvement -measured by recall, F-measure, fall-out, accuracy and SSLk-over the other algorithms.

Cite

CITATION STYLE

APA

Cobos, C., Duque, A., Bolaños, J., Mendoza, M., & León, E. (2017). Algorithm for clustering of web search results from a hyper-heuristic approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10062 LNAI, pp. 285–316). Springer Verlag. https://doi.org/10.1007/978-3-319-62428-0_24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free