This paper presents the protocol for integration of data coming from two most common types of biological data (clinical and molecular) for more effective classification patients with cancer disease. In this protocol, the identification of the most informative features is performed by using statistical and information-theory based selection methods for molecular data and the Boruta algorithm for clinical data. Predictive models are built with the help of the Random Forest classification algorithm. The process of data integration includes combining the most informative clinical features and the synthetic features obtained from genetic marker models as input variables for classifier algorithms. We applied this classification protocol to METABRIC breast cancer samples. Clinical data, gene expression data and somatic copy number aberrations data were used for clinical endpoint prediction. We tested the various methods for combining from different dataset information. Our research shows that both types of molecular data contain features that relevant for clinical endpoint prediction. The best model was obtained by using ten clinical and two synthetic features obtained from biomarker models. In the examined cases, the type of filtration molecular markers had a small impact the predictive power of models even though the lists of top informative biomarkers are divergent.
CITATION STYLE
Polewko-Klim, A., & Rudnicki, W. R. (2020). Data integration strategy for robust classification of biomedical data. In Advances in Intelligent Systems and Computing (Vol. 1160 AISC, pp. 596–606). Springer. https://doi.org/10.1007/978-3-030-45691-7_56
Mendeley helps you to discover research relevant for your work.