Summary. Human genetics is undergoing an information explosion. The availabil- ity of chip-based technology facilitates the measurement of thousands of DNA se- quence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA se- quence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute se- lection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data pre-processing, the ideal functional form of the fitness function, and the importance of expert knowledge.We anticipate this study will provide an important baseline for future studies investi- gating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.
CITATION STYLE
Moore, J. H., & White, B. C. (2007). Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge. In Genetic Programming Theory and Practice IV (pp. 11–28). Springer US. https://doi.org/10.1007/978-0-387-49650-4_2
Mendeley helps you to discover research relevant for your work.