In biomedical science, data mining techniques have been applied to extract statistically significant and clinically useful information from a given dataset. Finding biomarker gene sets for diseases can aid in understanding disease diagnosis, prognosis and therapy response. Gene expression microarrays have played an important role in such studies and yet, there have also been criticisms in their analysis. Analysis of these datasets presents the high risk of over-fitting (discovering spurious patterns) because of their feature-rich but case-poor nature. This paper describes a GA-SVM hybrid along with Gaussian noise perturbation (with a manual noise gain) to combat over-fitting; determine the strongest signal in the dataset; and discover stable biomarker sets. A colon cancer gene expression microarray dataset is used to show that the strongest signal in the data (optimal noise gain where a modest number of similar candidates emerge) can be found by a binary search. The diversity of candidates (measured by cluster analysis) is reduced by the noise perturbation, indicating some of the patterns are being eliminated (we hope mostly spurious ones). Initial biological validated has been tested and genes have different levels of significance to the candidates; although the discovered biomarker sets should be studied further to ascertain their biological significance and clinical utility. Furthermore, statistical validity displays that the strongest signal in the data is spurious and the discovered biomarker sets should be rejected. © 2011 Published by Elsevier Ltd.
Mathur, R., Schaffer, J. D., Land, W. H., Heine, J. J., Eschrich, S., & Yeatman, T. (2011). Evolutionary computation with noise perturbation and cluster analysis to discover biomarker sets. In Procedia Computer Science (Vol. 6, pp. 153–158). Elsevier B.V. https://doi.org/10.1016/j.procs.2011.08.030