Motivation: Trans -kingdom protein clustering remained difficult because of large sequence divergence between eukaryotes and prokaryotes and the presence of a transit sequence in organellar proteins. A large-scale protein clustering including such divergent organisms needs a heuristic to efficiently select similar proteins by setting a proper threshold for homologs of each protein. Here a method is described using two similarity measures and organism count. Results: The Gclust software constructs minimal homolog groups using all-against-all BLASTP results by single-linkage clustering. Major points include (i) estimation of domain structure of proteins; (ii) exclusion of multi-domain proteins; (iii) explicit consideration of transit peptides; and (iv) heuristic estimation of a similarity threshold for homologs of each protein by entropy-optimized organism count method. The resultant clusters were evaluated in the light of power law. The software was used to construct protein clusters for up to 95 organisms. © The Author 2009. Published by Oxford University Press. All rights reserved.
CITATION STYLE
Sato, N. (2009). Gclust: Trans-kingdom classification of proteins using automatic individual threshold setting. Bioinformatics, 25(5), 599–605. https://doi.org/10.1093/bioinformatics/btp047
Mendeley helps you to discover research relevant for your work.