Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

63Citations
Citations of this article
74Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/. © 2007 Wittkop et al; licensee BioMed Central Ltd.

Figures

  • Table 1: Evaluation of protein clustering tools. The F-measure (between 0 and 1) measures the agreement between a clustering resulting from a given algorithm and a reference clustering provided with the dataset. An F-measure of 1 indicates perfect agreement. ASTRAL95_1_161 and ASTRAL95_2_161 refer to the two datasets of SCOP v1.61 used by Paccanaro et al. for spectral clustering [7]. All reported values, except for our algorithm FORCE and for Affinity Propagation, are from the same reference.
  • Table 2: Evaluation of the WGCEP model. The best F-measures for each dataset and each similarity function. ASTRAL95_1_161 and ASTRAL95_2_161 are as in Table 1. ASTRAL95_1_171 and ASTRAL95_2_171 refer to the updated ASTRAL95 data of SCOP v1.71. BeH or SoH denote the similarity function, while the coverage factor f represents the influence of the coverage to the similarity.

References Powered by Scopus

This article is free to access.

Get full text
4648Citations
1931Readers
Get full text

Cited by Powered by Scopus

This article is free to access.

This article is free to access.

Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8. https://doi.org/10.1186/1471-2105-8-396

Readers over time

‘07‘08‘09‘10‘11‘12‘13‘14‘15‘16‘17‘18‘19‘20‘21‘250481216

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 41

58%

Researcher 21

30%

Professor / Associate Prof. 9

13%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 35

51%

Computer Science 25

37%

Biochemistry, Genetics and Molecular Bi... 7

10%

Chemical Engineering 1

1%

Save time finding and organizing research with Mendeley

Sign up for free
0