Pseudoreplication and the Design ...
Ecological Monographs 54(2), 1984,pp. 187-2 ~$3 1984 by the Ecological Society of America 11 PSEUDOREPLICATION AND THE DESIGN OF ECOLOGICAL FIELD EXPERIMENTS S TUART H. HURLBERT Department of Biology, San Diego State University, San Diego, California 92182 USA Abstract. Pseudoreplication is defined. as the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent. In ANOVA terminology, it is the testing for treatment effects with an error term inappropriate to the hypothesis being considered. Scrutiny of 176 experi- mental studies published between 1960 and the present revealed that pseudoreplication occurred in 27% of them, or 48% of all such studies that applied inferential statistics. The incidence of pseudo- replication is especially high in studies of marine benthos and small mammals. The critical features of controlled experimentation are reviewed. Nondemonic intrusion is defined as the impingement of chance events on an experiment in progress. As a safeguard against both it and preexisting gradients, interspersion of treatments is argued to be an obligatory feature of good design. Especially in small experiments, adequate interspersion can sometimes be assured only by dispensing with strict random- ization procedures. Comprehension of this conflict between interspersion and randomization is aided by distinguishing pre-layout (or conventional) and layout-specifit alpha (probability of type I error). Suggestions are offered to statisticians and editors of ecological j oumals as to how ecologists��� under- standing of experimental design and statistics might be improved. Key words: experimental design chi-square R. A. Fisher W. S. Gossett interspersion of treat- ments nondemonic intrusion randomization replicability type I error. No one would now dream of testing the response to a treat- ment by comparing two plots, one treated and the other un- treated. -R. A. Fisher and J. Wishart (1930) . . . field experiments in ecology [usually] either have no replication, or have so few replicates as to have very little sen- sitivity . . . -L. L. Eberhardt (1978) I don���t know how anyone can advocate an unpopular cause unless one is either irritating or ineffective. -Bertrand Russell (in Clark 1976:290) INTRODUCTION The following review is a critique of how ecologists are designing and analyzing their field experiments. It is also intended as an exploration of the fundamentals of experimental design. My approach will be: (1) to discuss some common ways in which experiments are misdesigned and statistics misapplied, (2) to cite a large number of studies exemplifying these problems, (3) to propose a few new terms for concepts now lacking convenient, specific labels, (4) to advocate treatment interspersion as an obligatory feature of good design, and (5) to suggest ways in which editors quickly can improve matters. Manuscript received 25 February 1983 revised 2 1 June 1983 accepted 25 June 1983. Most books on experimental design or statistics cov- er the fundamentals I am concerned with either not at all or only briefly, with few examples of misdesigned experiments, and few examples representing experi- mentation at the population, community or ecosystem levels of organization. The technical mathematical and mechanical aspects of the subject occupy the bulk of these books, which is proper, but which is also dis- tracting to those seeking only the basic principles. I omit all mathematical discussions here. The citing of particular studies is critical to the hoped- for effectiveness of this essay. To forego mention of specific negative examples would be to forego a pow- erful pedagogic technique. Past reviews have been too polite and even apologetic, as the following quotations illustrate: There is much room for improvement in field ex- perimentation. Rather than criticize particular in- stances, I will outline my views on the proper meth- ods . . . . (Connell 1974) In this review, the writer has generally refrained from criticizing the designs, or lack thereof, of the studies cited and the consequent statistical weakness of their conclusions it is enough to say that the ma- jority of the studies are defective in these respects. (Hurlbert 1975) . . . as I write my comments, I seem to produce onZy a carping at details that is bound to have the totaZ effect of an ill-tempered scolding . . . . I hope those whose work I have referenced as examples w i l l
188 STUART H. HURLBERT Ecological Monographs Vol. 54, No. 2 forgive me. I sincerely admire the quality of these papers . . . . (Hayne (1978) Among the 151 papers investigated, a number of common problems were encountered . . . .. It would be a profitless, and probably alienating, chore to dis- cuss these with respect to individual papers. (Under- wood 1981) But while I here offer neither anonymity nor blanket admiration, let me state an obvious fact- the quality of an investigation depends on more than good exper- imental design, so good experimental design by itself is no guarantee of the value of a study. This review does not evaluate the overall quality of any of the works discussed. Most of them, despite errors of design or statistics, nevertheless contain useful information. On the other hand, when reviewers have tried to emphasize the positive by pointing to particular field studies as being exemplary, their choices sometimes have seemed inappropriate. For example, Connell (1974) cites Boaden (1962) as being ���one of the best examples of a controlled field experiment��� and Chew (1978) cites Spitz (1968) as ���the best example I have of the responses of plants to grazing by small mam- mals.��� Yet neither of the cited studies replicated their treatments, and both are therefore uncontrolled for the stochastic factor. Spitz (1968) moreover, misapplies statistics, treating replicate samples as if they repre- sented replicate experimental units. The new terms offered have been carefully chosen. Perhaps mathematical statisticians will find them inel- egant, but I feel they will be helpful at least to ecologists and perhaps to other persons concerned with experi- mental design. Statistics and experimental design are disciplines with an impoverished vocabulary. Most of this essay concerns what a statistician might term ���ran- domization," ���replication, " ���independence,��� or ���error term��� problems, but these concepts can apply in many ways in an experiment, and they apply in different ways to different kinds of experiments. For example, one often can replicate at several levels (e.g., blocks, ex- perimental units, samples, subsamples, etc.) in the de- sign of an experiment at many levels the replication may be superfluous or optional, but there is usually at least one level (experimental unit) at which replication is obligatory, at least if significance tests are to be em- ployed. Likewise, the term ���error��� is used as shorthand for many different quantities or concepts, including: type I and type II errors, random and systematic errors introduced by the experimenter, variation among rep- licates, variation among samples, the discrepancy be- tween p and x, and so on. A slightly enlarged vocab- ulary, particularly one providing labels for various types of invalid procedures, may make things easier for us. I begin this discussion at an elementary level, pre- suming that the reader has had the equivalent of a one- semester course in statistics but no training in exper- imental design. This approach, and indeed, the whole essay, will seem too elementary to some ecologists. But I wish my premises and arguments to be explicit, clear, and easily attacked if in error. Also it is the elementary principles of experimental design, not advanced or es- oteric ones, which are most frequently and severely violated by ecologists. THE EXPERIMENTAL APPROACH There are five components to an experiment: hy- pothesis, experimental design, experimental execution, statistical analysis, and interpretation. Clearly the hy- pothesis is of primary importance, for if it is not, by some criterion, ���good,��� even a well-conducted exper- iment will be of little value. By experimental design is meant only ���the logical structure of the experiment��� (Fisher 197 1:2). A full description of the objectives of an experiment should specify the nature of the experimental units to be em- ployed, the number and kinds of treatments (including ���control��� treatments) to be imposed, and the proper- ties or responses (of the experimental units) that will be measured. Once these have been decided upon, the design of an experiment specifies the manner in which treatments are assigned to the available experimental units, the number of experimental units (replicates) receiving each treatment, the physical arrangement of the experimental units, and often, the temporal se- quence in which treatments are applied to and mea- surements made on the different experimental units. The execution of an experiment includes all those procedures and operations by which a decided-upon design is actually implemented. Successful execution depends on the experimenter���s artistry, insight, and good judgment as much as it does his technical skill. While the immediate goal is simply the conduct of the technical operations of the experiment, successful ex- ecution requires that the experimenter avoid intro- ducing systematic error (bias) and minimize random error. If the effects of DDT are being examined, the DDT must not be contaminated with parathion. If the effects of an intertidal predator are being assessed by the use of exclusion cages, the cages must have no direct effect on variables in the system other than the pred- ator. If the effects of nutrients on pond plankton are being studied, the plankton must be sampled with a device the efficiency of which is independent of plank- ton abundance. Systematic error either in the impo- sition of treatments or in sampling or measurement procedures renders an experiment invalid or inconclu- sive. Decisions as to what degree of initial heterogeneity among experimental units is permissible or desirable, and about the extent to which one should attempt to regulate environmental conditions during the experi- ment, are also a matter of subjective judgment. These decisions will affect the magnitude of random error and therefore the sensitivity of an experiment. They also will influence the specific interpretation of the re-
June 1984 PSEUDOREPLICATION AND EXPERIMENTAL DESIGN 189 sults, but they cannot by themselves affect the formal validity of the experiment. From the foregoing, it is clear that experimental de- sign and experimental execution bear equal responsi- bility for the validity and sensitivity of an experiment. Yet in a practical sense, execution is a more critical aspect of experimentation than is design. Errors in ex- perimental execution can and usually do intrude at more points in an experiment, come in a greater num- ber of forms, and are often subtler than design errors. Consequently, execution errors generally are more dif- ficult to detect than design errors, both for the exper- imenter himself and for readers of his reports. It is the insidious effects of such undetected or undetectable errors that make experimental execution so critical. Despite their pre-eminence as a source of problems, execution errors are not considered further here. In experimental work, the primary function of sta- tistics is to increase the clarity, conciseness, and ob- jectivity with which results are presented and inter- preted. Statistical analysis and interpretation are the least critical aspects of experimentation, in that if pure- ly statistical or interpretative errors are made, the data can be reanalyzed. On the other hand, the only com- plete remedy for design or execution errors is repetition of the experiment. MENSURATIVE EXPERIMENTS Two classes of experiments may be distinguished: mensurative and manipulative. Mensurative experi- ments involve only the making of measurements at one or more points in space or time space or time is the only ���experimental��� variable or ���treatment.��� Tests of significance may or may not be called for. Mensur- ative experiments usually do not involve the imposi- tion by the experimenter of some external factor(s) on experimental units. If they do involve such an impo- sition, (e.g., comparison of the responses of high-ele- vation vs. low-elevation oak trees to experimental de- foliation), all experimental units are ���treated��� identically. Example 1. We wish to determine how quickly maple (Acer) leaves decompose when on a lake bottom in 1 m of water. So we make eight small bags of nylon netting, fill each with maple leaves, and place them in a group at a spot on the l-m isobath. After 1 mo we retrieve the bags, determine the amount of organic matter lost (���decomposed���) from each, and calculate a mean decomposition rate. This procedure is satis- factory as far as it goes. However, it yields no infor- mation on how the rate might vary from one point to another along the l-m isobath the mean rate we have calculated from our eight leaf bags is a tenuous basis for making generalizations about ���the decomposition rate on the l-m isobath of the lake.��� Such a procedure is usually termed an experiment simply because the measurement procedure is some- what elaborate often involving intervention in or prodding of the system. If we had taken eight temper- ature measurements or eight dredge samples for in- vertebrates, few persons would consider those proce- dures and their results to be ���experimental��� in any way. Efforts at semantic reform would be in vain. His- torically, ���experimental��� has always had ���difficult,��� ���elaborate,��� and ���interventionist��� as among its com- mon meanings, and inevitably will continue to do so. The term mensurative experiment may help us keep in mind the distinction between this approach and that of the manipulative experiment. As the distinction is basically that between sampling and experimentation sensu stricto, advice on the ���design��� of mensurative experiments is to be found principally in books such as Sampling techniques (Cochran 1963) or Sampling methods for censuses and surveys (Yates 1960), and not in books with the word ���design��� in the title. Comparative mensurative experiments Example 2. We wish, using the basic procedure of Example 1, to test whether the decomposition rate of maple leaves differs between the 1 -m and the 10-m isobaths. So we set eight leaf bags on the l-m isobath and another eight bags on the 10-m isobath, wait a month, retrieve them, and obtain our data. Then we apply a statistical test (e.g., t test or U test) to see whether there is a significant difference between de- composition rates at the two locations. We can call this a comparative mensurative experi- ment. Though we use two isobaths (or ���treatments���) and a significance test, we still have not performed a true or manipulative experiment. We are simply mea- suring a property of the system at two points within it and asking whether there is a real difference (���treat- ment effect���) between them. To achieve our vaguely worded purpose in Example 1 perhaps any sort of distribution of the eight bags on the l-m isobath was sufficient. In Example 2, however, we have indicated our goal to be a comparison of the two isobaths with respect to decomposition rate of ma- ple leaves. Thus we cannot place our bags at a single location on each isobath. That would not give us any information on variability in decomposition rate from one point to another along each isobath. We require such information before we can validly apply infer- ential statistics to test our null hypothesis that the rate will be the same on the two isobaths. So on each isobath we must disperse our leaf bags in some suitable fashion. There are many ways we could do this. Locations along each isobath ideally should be picked at random, but bags could be placed individually (eight locations), in groups of two each (four locations), or in groups of four each (two locations). Furthermore, we might decide that it was sufficient to work only with the isobaths along one side of the lake, etc. Assuring that the replicate samples or measurements are dispersed in space (or time) in a manner appropriate
190 STUART H. HURLBERT Ecological Monographs Vol. 54, No. 2 to the specific hypothesis being tested is the most crit- ical aspect of the design of a mensurative experiment. Pseudoreplication in mensurative experiments Example 3. Out of laziness, we place all eight bags at a single spot on each isobath. It will still be legitimate to apply a significance test to the resultant data. How- ever, and the point is the central one of this essay, if a significant difference is detected, this constitutes evi- dence only for a difference between two (point) loca- tions one ���happens to be��� a spot on the l-m isobath, and the second ���happens to be��� a spot on the 10-m isobath. Such a significant difference cannot legiti- mately be interpreted as demonstrating a difference between the two isobaths, i.e., as evidence of a ���treat- ment effect.��� For all we know, such an observed sig- nificant difference is no greater than we would have found if the two sets of eight bags had been placed at two locations on the same isobath. If we insist on interpreting a significant difference in Example 3 as a ���treatment effect��� or real difference between isobaths, then we are committing what I term pseudoreplication. Pseudoreplication may be defined, in analysis of variance terminology, as the testing for treatment effects with an error term inappropriate to the hypothesis being considered. In Example 3 an error term based on eight bags at one location was inappro- priate. In mensurative experiments generally, pseu- doreplication is often a consequence of the actual phys- ical space over which samples are taken or measurements made being smaller or more restricted than the inference space implicit in the hypothesis being tested. In manipulative experiments, pseudoreplica- tion most commonly results from use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statis- tically independent. Pseudoreplication thus refers not to a problem in experimental design (or sampling) per se but rather to a particular combination of experi- mental design (or sampling) and statistical analysis which is inappropriate for testing the hypothesis of interest. The phenomenon of pseudoreplication is wide- spread in the literature on both mensurative and ma- nipulative experiments. It can appear in many guises. The remainder of this article deals with pseudorepli- cation in manipulative experiments and related mat- ters. M ANIPULATIVE EXPERIMENTS More on terminology Whereas a mensurative experiment may consist of a single treatment (Example 1), a manipulative exper- iment always involves two or more treatments, and has as its goal the making of one or more comparisons. The defining feature of a manipulative experiment is that the different experimental units receive different treatments and that the assignment of treatments to experimental units is or can be randomized. Note that in Example 2 the experimental units are not the bags of leaves, which are more accurately regarded only as measuring instruments, but rather the eight physical locations where the bags are placed. Following Anscombe (1948) many statisticians use the term comparative experiment for what I am calling manipulative experiment and absolute experiment for what I am calling mensurative experiment. I feel An- scombe���s terminology is misleading. It obscures the fact that comparisons also are the goal of many men- surative experiments (e.g., Example 2). Cox (1958:92-93) draws a distinction between treat- ment factors and classification factors that at first glance seems to parallel the distinction between mensurative and manipulative experiments. However it does not. For Cox, ���species��� would always be a classification factor, because ���species is an intrinsic property of the unit and not something assigned to it by the experi- menter.��� Yet ���species,��� like many other types of clas- sification factors, clearly can be the treatment variable in either a mensurative or a manipulative experiment. Testing the effects of a fire retardant on two types of wood (Cox���s example 6.3, simplified) or comparing decomposition rates of oak and maple leaves (my Ex- ample 5) represent manipulative experiments, with species being the treatment variable, and with random- ized assignment of treatments to experimental units (=physical locations) being possible. However, to mea- sure and compare the photosynthetic rates of naturally established oak and maple trees in a forest would be to conduct a mensurative experiment. Randomized as- signment of the two tree species to locations would not be possible. Cox���s ( 19 5 8) distinction of treatment factors vs. clas- sification factors is a valid one. But because it does not coincide with any dichotomy in experimental design or statistical procedures, it is less critical than the men- surative-manipulative classification proposed here. Critical features of a controlled experiment Manipulative experimentation is subject to several classes of potential problems. In Table 1 I have listed these as ���sources of confusion��� an experiment is suc- cessful to the extent that these factors are prevented from rendering its results inconclusive or ambiguous. It is the task of experimental design to reduce or elim- inate the influence of those sources numbered 1 through 6. For each potential source there are listed the one or more features of experimental design that will accom- plish this reduction. Most of these features are oblig- atory. Refinements in the execution of an experiment may further reduce these sources of confusion. How- ever, such refinements cannot substitute for the critical
June 1984 PSEUDOREPLICATION AND EXPERIMENTAL DESIGN 191 features of experimental design: controls, replication, T ABLE 1 . Potential sources of confusion in an experiment randomization, and interspersion. and means for minimizing their effect. One can always assume that certain sources of con- fusion are not operative and simplify experimental de- sign and procedures accordingly. This saves much work. However, the essence of a controlled experiment is that the validity of its conclusions is not contingent on the concordance of such assumptions with reality. Against the last source of confusion listed (Table 1), experimental design can offer no defense. The meaning of demonic and nondemonic intrusion will be clarified shortly. ,- Controls. - ���Control��� is another of those unfortu- nate terms having several meanings even within the context of experimental design. In Table 1, I use control in the most conventional sense, i.e., any treatment against which one or more other treatments is to be compared. It may be an ���untreated��� treatment (no imposition of an experimental variable), a ���procedur- al��� treatment (as when mice injected with saline so- lution are used as controls for mice injected with saline solution plus a drug), or simply a different treatment. Features of an experimental design that reduce or Source of confusion eliminate confusion 1. Temporal change Control treatments 2. Procedure effects Control treatments 3. Experimenter bias Randomized assignment of experimental units to treatments Randomization in conduct of other procedures ���Blind��� procedures* 4. Experimenter-gener- Replication of treatments ated variability (random error) 5. Initial or inherent Replication of treatments variability among Interspersion of treatments experimental units Concomitant observations 6. Nondemonic intrusion Replication of treatments Interspersion of treatments 7. Demonic intrusion Eternal vigilance, exorcism, human sacrifices, etc. At least in experimentation with biological systems, controls are required primarily because biological sys- tems exhibit temporal change. If we could be absolutely certain that a given system would be constant in its properties, over time, in the absence of an experimen- tally imposed treatment, then a separate control treat- ment would be unnecessary. Measurements on an ex- perimental unit prior to treatment could serve as controls for measurements on the experimental unit following treatment. * Usually employed only where measurement involves a large subjective element. t Nondemonic intrusion is defined as the impingement of chance events on an experiment in progress. also an uncontrolled experiment it is not controlled for the stochastic factor. The custom of referring to replication and control as separate aspects of experi- mental design is so well established, however, that ���control��� will be used hereafter only in this narrower, conventional sense. In many kinds of experiments, control treatments have a second function: to allow separation of the ef- fects of different aspects of the experimental procedure. Thus, in the mouse example above, the ���saline solution only��� treatment would seem to be an obligatory con- trol. Additional controls, such as ���needle insertion only��� and ���no treatment��� may be useful in some circum- stances. A broader and perhaps more useful (though less con- ventional) definition of ���control��� would include all the obligatory design features listed beside ���Sources of confusion��� numbers 1-6 (Table 1). ���Controls��� (sensu stricto control for temporal change and procedure ef- fects. Randomization controls for (i.e., reduces or elim- inates) potential experimenter bias in the assignment of experimental units to treatments and in the carrying out of other procedures. Replication controls for the stochastic factor, i.e., among-replicates variability in- herent in the experimental material or introduced by the experimenter or arising from nondemonic intru- sion. Interspersion controls for regular spatial variation in properties of the experimental units, whether this represents an initial condition or a consequence of non- demonic intrusion. A third meaning of control in experimental contexts is regulation of the conditions under which the exper- iment is conducted. It may refer to the homogeneity of experimental units, to the precision of particular treatment procedures, or, most often, to the regulation of the physical environment in which the experiment is conducted. Thus some investigators would speak of an experiment conducted with inbred white mice in the laboratory at 2 5 �� 1��C as being ���better controlled��� or ���more highly controlled��� than an experiment con- ducted with wild mice in a field where temperature fluctuated between 1 5 and 30��. This is unfortunate usage, for the adequacy of the true controls (i.e., control treatments) in an experiment is independent of the degree to which the physical conditions are restricted or regulated. Nor is the validity of the experiment af- fected by such regulation. Nor are the results of statis- tical analysis modified by it if there are no design or statistical errors, the confidence with which we can reject the null hypothesis is indicated by the value of P alone. These facts are little understood by many laboratory scientists. In this context it seems perfectly accurate to state that, for example, an experiment lacking replication is This third meaning of control undoubtedly derives in part from misinterpretation of the ancient but am- biguous dictum, ���Hold constant all variables except the one of interest.��� This refers not to temporal con-