Our goal in this paper is to show an analytical workflow for selecting protein biomarker candidates from SELDI-MS data. The clinical question at issue is to enable prediction of the complete remission (CR) duration for acute myeloid leukemia (AML) patients. This would facilitate disease prognosis and make individual therapy possible. SELDI-mass spectrometry proteomics analyses were performed on blast cell samples collected from AML patients pre-chemotherapy. Although the biobank available included approximately 200 samples, only 58 were available for analysis. The presented workflow includes sample selection, experimental optimization, repeatability estimation, data preprocessing, data fusion, and feature selection. Specific difficulties have been the small number of samples and the skew distribution of the CR duration among the patients. Further, we had to deal with both noisy SELDI-MS data and a diverse patient cohort. This has been handled by sample selection and several methods for data preprocessing and feature detection in the analysis work-flow. Four conceptually different methods for peak detection and alignment were considered, as well as two diverse methods for feature selection. The peak detection and alignment methods included the recently developed annotated regions of significance (ARS) method, the SELDI-MS software Ciphergen Express which was regarded as the standard method, segment-wise spectral alignment by a genetic algorithm (PAGA) followed by binning, and, finally, binning of raw data. In the feature selection, the "standard" Mann-Whitney t test was compared with a hierarchical orthogonal partial least-squares (O-PLS) analysis approach. The combined information from all these analyses gave a collection of 21 protein peaks. These were regarded as the most potential and robust biomarker candidates since they were picked out as significant features in several of the models. The chosen peaks will now be our first choice for the continuing work on protein identification and biological validation. The identification will be performed by chromatographic purification and MALDI MS/MS. Thus, we have shown that the use of several data handling methods can improve a protein profiling workflow from experimental optimization to a predictive model. The framework of this methodology should be seen as general and could be used with other one-dimensional spectral omics data than SELDI MS including an adequate number of samples.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below