Qscore: an algorithm for evaluati...
Qscore: An Algorithm for Evaluating SEQUEST Database Search Results Roger E. Moore, Mary K. Young, and Terry D. Lee Division of Immunology, Beckman Research Institute of the City of Hope, Duarte, California, USA A scoring procedure is described for measuring the quality of the results for protein identifications obtained from spectral matching of MS/MS data using the Sequest database search program. The scoring system is essentially probabilistic and operates by estimating the probability that a protein identification has come about by chance. The probability is based on the number of identified peptides from the protein, the total number of identified peptides, and the fraction of distinct tryptic peptides from the database that are present in the identified protein. The score is not strictly a probability, as it also incorporates information about the quality of the individual peptide matches. The result of using Qscore on a large test set of data was similar to that achieved using approaches that validate individual spectral matches, with only a narrow overlap in scores between identified proteins and false positive matches. In direct comparison with a published method of evaluating Sequest results, Qscore was able to identify an equivalent number of proteins without any identifiable false positive assignments. Qscore greatly reduces the number of Sequest protein identifications that have to be validated manually. (J Am Soc Mass Spectrom 2002, 13, 378���386) �� 2002 American Society for Mass Spectrometry Ppremierand roteolytic digestion followed by mass spectro- metry database searching has become the approach to sensitive identification of proteins. High throughput approaches to protein iden- tification depend on minimizing human time invest- ment in this analysis. A variety of techniques, including robotic gel band excision and digestion, automated matrix-assisted laser desorption/ionization (MALDI) spotting, autosampled nano-liquid chromatography tandem mass spectrometry (LC/MS/MS) analysis, and automated database searching, have been developed to further this aim. To make automated database search- ing possible, it is necessary to use a search program that can process spectra without need for human interpre- tation. In theory, it is also desirable to have an auto- mated scheme to determine the significance and reli- ability of the database search results. One of the more popular routines for database matching of peptide MS/MS spectra is Sequest [1]. Sequest can be used to analyze uninterpreted MS/MS spectra and provides a score for each match. However, exact standards for individual peptide matches to be considered significant, or for a group of peptide matches to indicate successful protein identification are still a matter of question. One criterion [2, 3] is what might be termed a golden match standard, in which a single, human validated peptide match is considered to be a conclusive identification of its precursor protein. While the golden match standard appears valid, it suffers from some difficulties associated with the in- completeness of protein databases. In our experience, it is sometimes the case that the second best match for a peptide would meet numerical and subjective criteria as a golden match if the top scoring peptide were removed from the database. This suggests that the same may be true of some peptides that generate top scores they are not actually the correct peptide, and generate the top match only because the correct peptide is not in the database. More generally, the golden match criterion is caught in a double bind. In order for a peptide to identify a protein uniquely from the database, it must be sufficiently long that it is unlikely to exist in several unrelated proteins. For a peptide of such length, though, there must be many isobaric peptides that are not present in the database, so the status of the identi- fied peptide as the unique best match cannot be con- firmed. If single matches, no matter how good, are insuffi- cient to ensure a correct protein identification, some approach based on multiplicity of matches must be used. Some authors have proposed standards for iden- tification based on multiple peptide matches [4], but like the golden match standard, these are basically ad hoc criteria. This paper attempts to create a reasonable, statistically based algorithm for determining goodness of protein matches. Published online February 21, 2002 Address reprint requests to Dr. T. D. Lee, Division of Immunology, Beckman Research Institute of the City of Hope, 1450 E. Duarte Road, Duarte, CA 91010. E-mail: tdlee@coh.org �� 2002 American Society for Mass Spectrometry. Published by Elsevier Science Inc. Received July 19, 2001 1044-0305/02/$20.00 Revised January 16, 2002 PII S1044-0305(02)00352-5 Accepted January 16, 2002
The expected number of matches can be established by a number of approaches. The simplest approach is to derive a prediction analytically for the generic case. Given: N number of individual searches M num- ber of matches against a specific protein P number of proteins in the database. The chance that a group of M searches will all match the same protein is then simply: Pmatch M, P P 1 M (1) And the chance that they will not all match is: Pno match M, P 1 P 1 M (2) The number of groups of M searches chosen from N searches is: Ngroups M, N N! N M !M! (3) The chance that no group of M matches will all match to the same protein is then: Pno match M, N, P 1 P 1 M N!/ N M !M! (4) Making the chance that there is such a match: Pmatch M, N, P 1 1 P 1 M N!/ N M !M! (5) The expected number of matches can also be estimated as the number of groups that can generate a match times the chance that each will generate a match: Nmatch M, N, P P 1 M N! N M !M! (6) It is important to note that for Pmatch(M, N, P) 1 (i.e., when the chance of a false positive is low), Pmatch(M, N, P) Nmatch(M, N, P). Nmatch(M, N, P) will tend to overestimate the number of matches if Nmatch(M 1, N, P) 1, as each match to M 1 spectra is treated as M 1 matches of M spectra. The results of these formulas can be experimentally tested by searching a group of real spectra against a deliberately falsified database. This was carried out for the two data sets described against both the rn and tsaey (sequence reversed) databases. Effective Database Size The results of database searching (Table 1) show the trends expected from the formulas. The number of matches tends to go up as the number of spectra searched increases and down as the number of proteins in the database increases. The actual numbers of pro- teins matched were not exactly as expected. The search of the tsaey database resulted in a few more matches than predicted, while that for the rn database gave many more than predicted. In effect, the tsaey database behaves as though it is somewhat smaller than its actual size and the rn database behaves as though it is much smaller than its actual size. Several factors appear to account for the discrepancy. A single search can result in matches to more than one protein. Short peptides can exist in several unre- lated proteins simply because there are a limited num- ber of possible sequences, while longer peptides may exist in several homologous proteins. Furthermore, some peptides while of different sequence are not distinguishable by mass spectrometry. Frequently this occurs because they contain isobaric amino acid substi- tutions, such as leucine for isoleucine or glutamine for lysine. Occasionally peptides with greater sequence differences will not be distinguished because fragments that could distinguish them are not observed. The presence of multiple identical or indistinguishable pep- tides in the database reduces its effective size. Another critical factor is that peptides are not evenly distributed among all proteins. Large proteins may contain hundreds of peptides that may be matched in a Table 1. Results of searching different data sets against sequence reversed databases. Data shown are the actual and predicted number of matches generating at least 2 or 3 unique peptide sequences, and the largest number of unique peptide sequences for any match Peptides required for a protein match Number of protein matches data set 1 (511 searches) Number of protein matches data set 2 (4316 searches) Tsaey databaseb Rn databasec Tsaey databaseb Rn databasec 2 or morea Actual 27 2 1034 118 Predicted 20.7 0.27 1479 19.2 3 or more Actual 1 0 402 4 Predicted 0.56 9.4 10 5 338 5.7 10 2 Largest Actual 3 2 10 3 Predicted 3 1 6 2 aThe predicted values for 2 or more peptides/protein are actually overestimates, as each group of 3 matches is treated as 3 independent groups of 2 matches. bThe tsaey database contains 6298 proteins. cThe rn database contains 483730 proteins. 380 MOORE ET AL. J Am Soc Mass Spectrom 2002, 13, 378 ���386