We will review machine learning approaches to genome wide association studies with a focus on radiogenomics studies. The desire to develop machine learning approaches is motivated by the hypothesis that predictive models are best determined by building (what amounts to) "non-linear voting machines." Unlike standard statistical methods, the individual "voters" do not all need to be validated; instead, the wisdom of the crowd prevails. Genome wide association studies (GWAS) correlate a large number (typically ~ 1 million) of single nucleotide polymorphisms (SNPs) with an observed endpoint. When correlated with radiotherapy endpoints, the studies have been referred to as 'radiogenomics,' but many other endpoints have now been studied with GWAS. Typical GWAS analysis methods have focused on determining the statistical significance of the most highly correlated SNPs. These methods depend on having very large datasets and SNPs with large effect sizes in an attempt to overcome statistical noise inherent to extreme tails. Alternatively, some groups have applied machine learning approaches to GWAS analysis. We have developed a multistep machine learning method to build predictive models based on GWAS data and modest sized dataset (hundreds of patients.) The method relies on the crucial low-noise property of SNP measurements. The core machine learning step is based on the random forest methodology, which is well-suited to genomic biomarkers. The model itself discovers and emphasizes conditional relationships between SNPs through individual decision trees. These models can further be analyzed to understand key biological network sub-components that are critical to the observed endpoint. The overall impact of individual SNPs is ranked through permutation testing, and the resulting ranked list is analyzed using curated network databases to identify key biological interactions and processes. We will discuss the process and application to predicting toxicity following prostate radiotherapy, including erectile dysfunction, late rectal bleeding, and urinary dysfunction. We will also discuss limitations, alternative approaches, and potential applications.
Deasy, J., Lee, S. K., Oh, J. H., Kerns, S., Orstrer, H., & Rosenstein, B. (2018). SP-0484: Machine Learning of radiogenomics SNP GWAS to predict complication risk and to identify key biological correlates. Radiotherapy and Oncology, 127, S249. https://doi.org/10.1016/s0167-8140(18)30794-1