It is important to understand both what a classification metric expresses and what it hides. Last month we examined the use of logistic regression for classifica-tion, in which the class of a data point is predicted given training data 1 . This month, we look at how to evaluate classifier performance on a test set—data that were not used for training and for which the true classification is known. Classifiers are commonly evaluated using either a numeric metric, such as accuracy, or a graphical repre-sentation of performance, such as a receiver operating characteristic (ROC) curve. We will examine some common classifier metrics and discuss the pitfalls of relying on a single metric. Metrics help us understand how a classifier performs; many are available, some with numerous tunable parameters. Understanding metrics is also critical for evaluating reports by others—if a study presents a single metric, one might question the performance of the classifier when evaluated using other metrics. To illustrate the pro-cess of choosing a metric, we will simulate a hypothetical diagnostic test. This test classifies a patient as having or not having a deadly disease on the basis of multiple clinical factors. In evaluating the classifier, we consider only the results of the test; neither the underly-ing mechanism of classification nor the underlying clinical factors are relevant. Classification metrics are calculated from true positives (TPs), false positives (FPs), false negatives (FNs) and true negatives (TNs), all of which are tabulated in the so-called confusion matrix (Fig. 1). The relevance of each of these four quantities will depend on the purpose of the classifier and motivate the choice of metric. For a medical test that determines whether patients receive a treatment that is cheap, safe and effective, FPs would not be as important as FNs, which would represent patients who might suffer without ade-quate treatment. In contrast, if the treatment were an experimental drug, then a very conservative test with few FPs would be required to avoid testing the drug on unaffected individuals. In Figure 2 we show three classification scenarios for four differ-ent metrics: accuracy, sensitivity, precision and F 1 . In each panel, all of the scenarios have the same value (0.8) of a given metric. Accuracy is the fraction of predictions that are true. Although this metric is easy to interpret, high accuracy does not necessarily characterize a good classifier. For instance, it tells us nothing about whether FNs or FPs are more common (Fig. 2a). If the disease is rare, predict-ing that all the subjects will be negative offers high accuracy but is not useful for diagnosis. A useful measure for understanding FNs is sensitivity (also called recall or the true positive rate), which is the proportion of known positives that are predicted correctly. However, neither TNs nor FPs affect this metric, and a classifier that simply predicts that all data points are positive has high sensitivity (Fig. 2b). Specificity, which measures the fraction of actual negatives that are correctly predicted, suffers from a similar weakness: not accounting for FNs or TPs. Both TPs and FPs are captured by precision (also called the positive predictive value), which is the proportion of pre-dicted positives that are correct. However, precision captures neither TNs nor FNs (Fig. 2c). A very conservative test that predicts only one subject will have the disease—the case that is most certain—has a perfect precision score, even though it misses any other affected subjects with a less certain diagnosis. Ideally a medical test should have very low numbers of both FNs and FPs. Individuals who do not have the disease should not be given unnecessary treatment or be burdened with the stress of a positive result, and those who do have the disease should not be given false optimism about being disease free. Several aggregate metrics have been proposed for classification evaluation that more completely summarize the confusion matrix. The most popular is the F β score, which uses the parameter β to control the balance of recall and precision and is defined as F β = (1 + β) 2 (Precision × Recall)/(β 2 × Precision + Recall). As β decreases, precision is given greater weight. With β = 1, we have the commonly used F 1 score, which balances recall and precision equally and reduces to the sim-pler equation 2TP/(2TP + FP + FN). The F β score does not capture the full confusion matrix because it is based on the recall and precision, neither of which uses TNs, which might be important for tests of very prevalent diseases. One approach that can capture all the data in the confusion matrix is the Matthews correlation coefficient (MCC), which ranges from –1 (when the classification is always wrong) to 0 (when it is no better than random) to 1 (when it is always correct). It should be noted that in a comparison of the results of two classifiers, one Figure 2 | The same value of a metric can correspond to very different classifier performance. (a–d) Each panel shows three different classification scenarios with a table of corresponding values of accuracy (ac), sensitivity (sn), precision (pr), F 1 score (F 1) and Matthews correlation coefficient (MCC). Scenarios in a group have the same value (0.8) for the metric in bold in each table: (a) accuracy, (b) sensitivity (recall), (c) precision and (d) F 1 score. In each panel, those observations that do not contribute to the corresponding metric are struck through with a red line. The color-coding is the same as in Figure 1; for example, blue circles (cases known to be positive) on a gray background (predicted to be negative) are FNs.
CITATION STYLE
Lever, J., Krzywinski, M., & Altman, N. (2016). Erratum: Corrigendum: Classification evaluation. Nature Methods, 13(10), 890–890. https://doi.org/10.1038/nmeth1016-890b
Mendeley helps you to discover research relevant for your work.