Abstract
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove invaluable for bioinfor-maticians. To this end, we have developed an open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies. In TPOT-MDR, we implement Multifactor Dimensionality Reduction (MDR) as a feature construction method for modeling higher-order feature interactions, and combine it with a new expert knowledge-guided feature selector for large biomedical data sets. We demonstrate TPOT-MDR's capabilities using a combination of simulated and real world data sets from human genetics and find that TPOT-MDR significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline discovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's ability to produce a high-accuracy solution that is also easily interpretable.
Author supplied keywords
Cite
CITATION STYLE
Sohn, A., Olson, R. S., & Moore, J. H. (2017). Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming. In GECCO 2017 - Proceedings of the 2017 Genetic and Evolutionary Computation Conference (pp. 489–496). Association for Computing Machinery, Inc. https://doi.org/10.1145/3071178.3071212
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.