Inference without significance: m...
SPECIAL TOPIC Inference without significance: measuring support for hypotheses rather than rejecting them Tim Gerrodette NOAA National Marine Fisheries Service, Southwest Fisheries Science Center, La Jolla, CA, USA ���Null hypothesis testing in the statistical sciences is like protoplasm in biology they both served an early purpose but are no longer very useful��� (Anderson 2008). Introduction David Anderson���s comment is part of a long history of criticism of null hypothesis significance testing (NHST) by statisticians and statistically minded biologists. Some colorful comments about NHST procedures are that they are ���not a contribution to science��� (Savage 1957), ���a serious impediment to the interpretation of data��� (Skipper et al. 1967), ���worse than irrelevant��� (Nelder 1985), ���diffi- cult to take seriously��� (Chernoff 1986), and ���completely devoid of practical utility��� (Finney 1989). The long and intense criticism does not seem to have had much effect in ecology. Although use has declined slightly (Fidler et al. 2006 Hobbs & Hilborn 2006), NHST and its associ- ated P-value are currently used in over 90% of papers in ecology and evolution (Stephens et al. 2006). NHST is based on positing that a certain condition (the null hypothesis) is true and then calculating the probability (the P-value) of the observed data, or of Keywords Bayesian likelihood null hypothesis Phocoena sinus significance test statistical inference. Correspondence Tim Gerrodette, NOAA National Marine Fisheries Service, Southwest Fisheries Science Center, La Jolla, CA, USA. E-mail: firstname.lastname@example.org Accepted: 20 May 2011 doi:10.1111/j.1439-0485.2011.00466.x Abstract Despite more than half a century of criticism, significance testing continues to be used commonly by ecologists. Significance tests are widely misused and mis- understood, and even when properly used, they are not very informative for most ecological data. Problems of misuse and misinterpretation include: (i) invalid logic (ii) rote use (iii) equating statistical significance with biological importance (iv) regarding the P-value as the probability that the null hypothe- sis is true (v) regarding the P-value as a measure of effect size and (vi) regard- ing the P-value as a measure of evidence. Significance tests are poorly suited for inference because they pose the wrong question. In addition, most null hypotheses in ecology are point hypotheses already known to be false, so whether they are rejected or not provides little additional understanding. Eco- logical data rarely fit the controlled experimental setting for which significance tests were developed. More satisfactory methods of inference assess the degree of support which data provide for hypotheses, measured in terms of informa- tion theory (model-based inference), likelihood ratios (likelihood inference) or probability (Bayesian inference). Modern statistical methods allow multiple data sets to be combined into a single likelihood framework, avoiding the loss of information that can occur when data are analyzed in separate steps. Inference based on significance testing is compared with model-based, likelihood and Bayesian inference using data on an endangered porpoise, Phocoena sinus. All of the alternatives lead to greater understanding and improved inference than pro- vided by a P-value and the associated statement of statistical significance. Marine Ecology. ISSN 0173-9565 404 Marine Ecology 32 (2011) 404���418 Published 2011. This article is a US Government work and is in the public domain in the USA.
unobserved data more extreme, given the hypothesis and the probability model.1 The basic idea is that an improba- ble outcome (small P) is reasonable cause to question the validity of the null hypothesis. The assumption of a null hypothesis leads to the burden-of-proof issue, because the null hypothesis remains the accepted condition unless and until data indicate that it should be rejected. In the context of conservation or wildlife management, where data are often limited, the requirement to disprove a null hypothesis of no effect or no impact can have non-pre- cautionary implications (Peterman & M���Gonigle 1992 Taylor & Gerrodette 1993 Dayton 1998 Brosi & Biber 2009). As currently used, NHST is a combination of ideas developed in the 1920s and 1930s, primarily by Fisher (1925) and Neyman & Pearson (1933). Actually, these and other statisticians had substantially different views about the nature and role of statistics in the scientific process. There were vigorous disagreements at the time (Goodman 1993 Inman 1994 Salsburg 2001) and it is doubtful that any of them would approve of NHST as it is practiced today. Fisher���s idea was that the P-value was an ���aid to judgment��� about the truth of a hypothesis. A small P-value meant that the data did not support the hypothesis, but Fisher was not dogmatic about a 0.05 cut- off for significance (see Hurlbert & Lombardi 2009 for changes in Fisher���s thinking), nor did he view the out- come of any single experiment as decisive. Neyman and Pearson, on the other hand, explicitly framed the problem as a decision between two competing hypotheses. Fisher���s P-value was a flexible measure of evidence, whereas the Neyman���Pearson test was a rule for behavior which would minimize the rate (or frequency, hence the term ���frequentist���) of incorrect decisions. The modern hybrid NHST combines these ideas by identifying Fisher���s P with the Neyman���Pearson Type I error rate a. The two meth- ods are fundamentally incompatible, and the result is ���a mishmash of Fisher and Neyman-Pearson, with invalid Bayesian interpretation��� (Cohen 1994). Their combination ���has obscured the important differences between Neyman and Fisher on the nature of the scientific method and inhibited our understanding of the philosophic implica- tions of the basic methods in use today��� (Goodman 1993). This paper makes three points: (i) that NHST is widely misused and misunderstood (ii) that even when properly used, NHST is only marginally informative for most eco- logical data and (iii) that better methods of inference are available. The first two points are covered relatively briefly, as the problems with NHST have long been well described.2 However, many of the papers are in the statis- tical, medical and social science literature and may not be familiar to ecologists. An excellent paper on ���the insignifi- cance of significance testing��� for ecologists is Johnson (1999) (see also Yoccoz 1991 McBride et al. 1993 Ellison 1996 Cherry 1998 Germano 1999 Anderson et al. 2000 Laara �� �� �� 2009). The third point is illustrated by working through a specific example, showing that alternatives to NHST can give greater insight and understanding of data. As in any branch of science, new and improved statisti- cal methods are constantly being developed. Ecologists would not use 80-year-old genetic or physiological tech- niques when more powerful and useful methods are avail- able. Why don���t we apply the same standards when drawing conclusions from our data? Misuse and misunderstanding of NHST Probably the most pervasive and serious misuse of NHST is to interpret a P-value as the probability that the null hypothesis is true. A small P-value, particularly P 0.05, is taken to mean that the null hypothesis is false, or at least likely to be false. A variant of this misinterpretation is to regard a non-significant result as confirmation of the null hypothesis. Thus, after finding that P 0.05, a com- mon conclusion is something like ���There is no difference��� or ���There is no effect���. Another variant, when there is a clear alternative hypothesis, is to interpret 1 ) P as the probability that the alternative hypothesis is true. Yet another variant is that if the null hypothesis is rejected, the theory or idea that motivated the test must be true. 1This paper primarily addresses a point-null hypothesis, which posits the strict equality of a parameter (e.g. the mean) among the groups tested. A point-null hypothesis is the most common form of NHST in the ecological lit- erature. Alternatives such as interval and one-sided tests, which use a similar inferential procedure but posit non- null hypotheses, are discussed briefly later. Despite the misnomer, NHST as used here refers generally to infer- ence conditioned on a hypothesis. 2David Anderson maintains two lists of hundreds of quotations and citations critical of null hypothesis testing, one compiled through 1997 by Marks Nester http://warn- ercnr.colostate.edu/~anderson/nester.html, and another compiled through 2001 by Bill Thompson http://warn- ercnr.colostate.edu/~anderson/thompson1.html. An updated list through 2010 can be found at http://swfsc.noaa. gov/SignificanceTestRefs. For discussions of informed use of NHST, see Cox (1977), Harlow et al. (1997), Nickerson (2000), Guthery et al. (2001), Robinson & Wainer (2002), McBride (2005), Stephens et al. (2006) and Mart��nez �� del Rio et al. (2007). Gerrodette Inference without significance Marine Ecology 32 (2011) 404���418 Published 2011. This article is a US Government work and is in the public domain in the USA. 405
In one form or another, all of these misuses involve regarding the P-value as a statement about the probability of a hypothesis being true. But P cannot be a statement about the probability of the truth or falsity of any hypothesis because the calculation of P is based on the assumption that the null hypothesis is true. P is the prob- ability of data (or of data more extreme) conditional on a hypothesis, not the probability of a hypothesis conditional on data. This may sound like statistical double-talk but the difference is fundamental. The probability that I will encounter a certain species, given that it is rare in the study area, is quite different from the probability that the species is rare in the area, given that I have encountered it. The common misinterpretation of P as the probability that the null hypothesis is true is appealing because it seems logical. Consider the following: If the hypothesis is true, this observation cannot occur. This observation has occurred. Therefore, the hypothesis is false. This is a valid syllogism in deductive logic called modus tollens, or denying the consequent. The logic of NHST is similar, but the statements are probabilistic: If the null hypothesis is true, this observation is un- likely to occur. This observation has occurred. Therefore, the null hypothesis is likely to be false. The structure of the NHST argument is the same, and the logic seems reasonable. But it is invalid. Why? Because we have moved from the black-and-white of deductive logic to the grays of probabilistic inference, and the rules are different. An example will show the fallacy: If this person is a chemist, he ��� she is unlikely to win a Nobel Prize in chemistry. This person has won a Nobel Prize in chemistry. Therefore, this person is unlikely to be a chemist. The first statement, while true, is about the proportion of chemists who win Nobel Prizes. The validity of the conclusion, however, depends on the proportion of Nobel Prize winners in chemistry who are chemists, and neither the data (the second statement) nor the assumptions (the first statement) say anything about that. We are attempt- ing to make a statement about the probability of truth of a statement (that the person is a chemist) using a frame- work of deductive logic when the situation calls for prob- abilistic reasoning. In the language of conditional probabilities, we want the probability of being a chemist conditional on winning a Nobel Prize in chemistry, not the probability of winning a Nobel Prize conditional on being a chemist. The illogic of NHST has been pointed out many times before (e.g. Berkson 1942 Rozeboom 1960 Bakan 1966 Oakes 1986 Cohen 1994 Royall 1997 Goodman 1999 Trafimow 2003). Despite the faulty logic, the NHST pseudo-syllogism is alluring and continues as one of the ���fantasies of statistical significance��� (Carver 1978). A fur- ther complication is that there are situations where the logic seems to work perfectly well. Substitute ���this is a fair coin��� for ���this person is a chemist��� and ���10 ������heads������ in a row��� for ���win a Nobel Prize��� in the argument above, and everything can seem fine. Aside from its logical problems, NHST has a subtly corrosive effect because it permits lazy analysis and impedes clear thinking. NHST has become so ingrained and automatic that it is carried out by researchers and is often required by journal editors with little thought about what the procedure means or whether it is necessary. ���Sta- tistical ������recipes������ are followed blindly, and ritual has taken over from scientific thinking��� (Preece 1984). ���The rituali- zation of NHST [has been carried] to the point of mean- ingless and beyond��� (Cohen 1994). Nelder (1985) decried ���the grotesque emphasis on significance tests���, and Sals- burg (1985) satirized ���the religion of statistics���. Statistical ritual leads to publication of papers that are methodologi- cally impeccable but contain little actual information (Guthery 2008). For example, one result of rote use of NHST is a con- fusion of statistical significance and biological importance (Boring 1919 Jones & Matloff 1986). Many ecologists are familiar with the idea that important biological effects may exist but not be statistically significant in a particular study because of small sample size. Calcula- tions of statistical power can be helpful in this situation (Gerrodette 1987 Cohen 1988 Urquhart & Kincaid 1999 Gray & Burlew 2007), but power calculations, espe- cially post hoc, are themselves confusing, misunderstood and misused (Goodman & Berlin 1994 Steidl et al. 1997). In particular, a power calculation based on the observed data provides no more information than the P-value itself (Thomas 1997 Hoenig & Heisey 2001). Despite the general awareness of possible low power, a biological effect that is not statistically significant is often summarily dismissed as unimportant (���Growth rate was not significantly related to temperature���) or nonexistent (���There was no difference in mean length among groups���). Effect size may not even be reported, leaving the reader uninformed about the estimated size of the biological effect, statistically significant or not. However, lack of statistical significance does not mean lack of bio- logical importance. The converse can also happen ��� that is, biologically unimportant effects can be statistically significant if sam- ple size is large. When samples can be obtained relatively easily and cheaply (e.g. air, water or, increasingly, genetic samples), large sample size can lead to detection of trivial Inference without significance Gerrodette 406 Marine Ecology 32 (2011) 404���418 Published 2011. This article is a US Government work and is in the public domain in the USA.