Proteome survey reveals modularit...
�� 2006 Nature Publishing Group Proteome survey reveals modularity of the yeast cell machinery Anne-Claude Gavin1*���, Patrick Aloy2*, Paola Grandi1, Roland Krause1,3, Markus Boesche1, Martina Marzioch1, Christina Rau1, Lars Juhl Jensen2, Sonja Bastuck1, Birgit Dumpelfeld1, �� Angela Edelmann1, Marie-Anne Heurtier1, Verena Hoffman1, Christian Hoefert1, Karin Klein1, Manuela Hudak1, Anne-Marie Michon1, Malgorzata Schelder1, Markus Schirle1, Marita Remor1, Tatjana Rudi1, Sean Hooper2, Andreas Bauer1, Tewis Bouwmeester1, Georg Casari1, Gerard Drewes1, Gitte Neubauer1, Jens M. Rick1, Bernhard Kuster1, Peer Bork2, Robert B. Russell2 & Giulio Superti-Furga1,4 Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. Here we report the first genome-wide screen for complexes in an organism, budding yeast, using affinity purification and mass spectrometry. Through systematic tagging of open reading frames (ORFs), the majority of complexes were purified several times, suggesting screen saturation. The richness of the data set enabled a de novo characterization of the composition and organization of the cellular machinery. The ensemble of cellular proteins partitions into 491 complexes, of which 257 are novel, that differentially combine with additional attachment proteins or protein modules to enable a diversification of potential functions. Support for this modular organization of the proteome comes from integration with available data on expression, localization, function, evolutionary conservation, protein structure and binary interactions. This study provides the largest collection of physically determined eukaryotic cellular machines so far and a platform for biological data integration and modelling. Genomes are remarkable in that they encode most of the functions necessary for their interpretation and propagation1. However, many principles as to how individual gene products form the structures required for biological activity are still unknown. Biological processes, such as the cell cycle and replication, require precise organization of molecules in time and space. Complexes are among the fundamental units of macromolecular organization2. They are thought to assemble in a particular order, and often require energy-driven conformational changes, specific post-translational modifications or chaperone assist- ance for proper formation3. Their composition is also known to vary according to cellular requirements. Affinity purification methods are well suited for studying com- plexes under near-physiological conditions4,5. They allow macromol- ecules physically associated with a tagged bait to be retrieved and identified by mass spectrometry6,7. These methods have been applied as large-scale screens in prokaryotic and eukaryotic cells, and have led to a growing collection of cellular machines8���11 that, in combination with large-scale yeast two-hybrid studies12,13, are powerful integrators of additional biological data14���16. However, in the absence of a genome-wide screen, where many complexes are retrieved repeatedly through a ���reverse purification��� process, assignment of a component to a particular complex relied heavily on experimental stringency and arbitrary thresholds. Here we report the first genome-wide screen for complexes to investigate the underlying organizational principles of the eukaryotic cellular machinery. Genome-wide characterization of complexes We applied the tandem-affinity-purification method coupled to mass spectrometry (TAP���MS)6���8 to all 6,466 ORFs of Saccharomyces cerevisiae as annotated in 2002 (refs 17, 18 Fig. 1 and Supplementary Information). We employed standardized protocols and successfully purified 1,993 unique TAP-fusion proteins, of which 88% retrieved at least one partner (Fig. 1 Supplementary Table S1). From all purifi- cations, we processed 52,000 samples for mass spectrometry and identified 36,000 proteins, of which 2,760 were distinct (Fig. 1 Supplementary Figs S2���S5). These represent about 60% of the estimated proteome for exponentially growing yeast19���21, and cover all functional classes and subcellular localizations. The absolute abundances of the identified proteins show a wide range, from 32 to 500,000 copies per cell19, although coverage varied considerably, being highest for the most abundant proteins (.16,000 copies per cell: 80% coverage), and lowest for the rarest proteins (,500 copies: 40% coverage) (Supplementary Fig. S1). We measured reproduci- bility by performing 139 purifications in duplicate (99 soluble 40 membrane), and found that, on average, 69% of recovered proteins were common to both, giving an approximation of false-positive/ negative rates within the raw data. However, as complexes are retrieved in several purifications, interactions observed repeatedly are more likely to be correct (see below). The purification data contains 73% of known complexes from the Munich Information Center for Protein Sequences (MIPS) data- base22 (217 complexes) and our own literature mining (62 com- plexes). We found no evidence for 74 known complexes, possibly because they may not assemble under our growth conditions or because the tag interferes with complex assembly8. This is the case for the partially recovered CCT (chaperonin-containing tailless complex ARTICLES 1Cellzome AG, Meyerhofstrasse 1, 69117 Heidelberg, Germany. 2EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany. 3MPI-MG, MPI-IB, Charite �� Campus Mitte, Schumannstrasse 21/22, 10117 Berlin, Germany. 4Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 19, 1090 Vienna, Austria. ���Present address: EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany. *These authors contributed equally to this work. Vol 440|30 March 2006|doi:10.1038/nature04532 631
�� 2006 Nature Publishing Group polypeptide 1) complex���the carboxy termini of the eight subunits in the ring-like core of the complex lie on interaction interfaces23. However, these situations could often be rescued: 30% of TAP-tagged proteins that we could not purify were detected in purifications using other complex components. We used a modified purification procedure for membrane proteins and successfully purified 340 of the 628 that were tagged. For example, we retrieved the Q/t-SNARE complex, including both integral membrane components of the trimeric receptor (Use1, Sec20 and Ufe1) and the peripheral membrane machinery (Dsl1, Sec39, Tip20) required for stability24. We also detected novel links such as that between the Akr1 palmitoyl transferase (a six-transmem- brane-segment protein) and Ste4 (the Gb subunit of the pheromone receptor-coupled G protein), which is consistent with genetic evi- dence25 and supports a role for protein acylation in the pheromone response. De novo definition of protein complexes The proportion of new proteins identified per purification dropped asymptotically during the progression of the screen, suggesting that the procedure was to near saturation (Supplementary Fig. S6a). We also observed that 64% of known complexes22 were retrieved several times resulting in a high coverage of known components (Sup- plementary Fig. S6b). We exploited this redundancy to define complexes computationally. Current approaches for defining com- plexes from binary interactions26 were not deemed appropriate as these are not directly inferable from purifications. We also explicitly avoided the incorporation of prior knowledge to circumvent any bias towards well-studied proteins. We first derived a ���socio-affinity��� index (see the Methods) that quantifies the propensity of proteins to form partnerships. It measures the log-odds of the number of times two proteins are observed together, relative to what would be expected from their frequency in the data set, and encompasses both the ���spoke��� and the ���matrix��� models for assigning binary interactions within purifi- cations. The index accounts for the frequency of proteins within the data set and thus naturally discriminates true from spurious interactions involving very promiscuous partners. For instance, Vma2, which was seen in 552 purifications and would have been ignored under previous high-frequency filtering strategies8,9, showed high indices only with proteins it is known to associate with (Vma5, Vma6, Vma10 and Rav1). Generally, pairs with socio-affinity indices below 5 should be considered with caution (reproducibility ,70%), though those above 5 are more reliable (89%). These indices capture some biochemical properties of protein���protein interactions: there is a tentative correlation with the few dissociation constants available in the literature (P , 0.08) and protein pairs with high socio-affinity indices are more likely to be in direct contact as measured either by three-dimensional structures or the yeast two-hybrid system (Sup- plementary Fig. S7). To our knowledge, this is the first attempt to re- create numbers approximating physical measurements purely from proteomics data. If each protein only belonged to a single complex, we could generate a definitive set by a single clustering step using socio-affinity indices. However, it is well established that proteins can be present in multiple complexes a property we reasoned could be captured by an iterative procedure. Briefly, we first used the socio-affinity indices to form a matrix for all pairs of proteins studied, and then applied cluster analysis to generate an initial list of complexes. We then subtracted a penalty from the initial matrix values and repeated clustering. Tight associations are not drastically affected by the penalty, while looser ones are gradually eroded, and can be replaced by others not present initially. We varied the clustering parameters (number of iterations, clustering type, penalty values, and so on) over a sensible range to produce 1,784 different complex sets, and compared each to a manually curated group of known complexes used for structural analysis14. We computed both coverage (that is, the fraction of proteins in known complexes that we retrieved) and accuracy (that is, the fraction of the retrieved complexes components that match those already known Fig. 1). The best conditions generated a collection of 491 complexes with 83% coverage and 78% accuracy. However, inspection revealed that known complex Figure 1 | Synopsis of the genome-wide screen for complexes and data analysis. a, Summary of the overall experimental strategy. MIPS/SGD, Munich Information Center for Protein Sequences/Saccharomyces Genome Database. b, Definition and terminology used to define protein-complex architecture. ARTICLES NATURE|Vol 440|30 March 2006 632