Prediction of DNA-binding residues from protein sequence information using random forests

109Citations
Citations of this article
50Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. Results: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. Conclusion: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF http://bioinfo.ggc.org/bindn-rf/ has thus been developed to make the RF classifier accessible to the biological research community. © 2009 Wang et al; licensee BioMed Central Ltd.

Cite

CITATION STYLE

APA

Wang, L., Yang, M. Q., & Yang, J. Y. (2009). Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics, 10(SUPPL. 1). https://doi.org/10.1186/1471-2164-10-S1-S1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free