Permutation tests for multi-facto...
able for multiple regression: (i) restricted permutations (i.e., allowing permutations to occur only within levels of other factors) and (ii) permutation of units other than the observations (e.g., permuting the units induced by a nested factor in the test of a higher ranked factor). The second strategy may be pertinent in designs with random factors. In many ANOVA designs, an exact test for an individual factor can be constructed using one or other or both of these additional strategies. It is virtually entirely unknown, however, just how such tests compare with the various approximate tests in terms of their power for different kinds of designs. Thus, it is unclear, for any particular term in a complex ANOVA design, which combination of (i) choice of permutable units, (ii) use of restricted permutation and=or (iii) permutation of raw data or residuals will provide the most powerful or most appropriate strategy. In order to construct an appropriate permutation distribution for any term in a complex ANOVA design, it is necessary to clarify several important issues, including: (a) Should raw data be permuted, or some form of residuals? (b) Which units should be permuted: indi- vidual observation units or some larger units, such as levels of a nested factor in a test of the higher ranked factor of a hierarchy? (c) When should permutations be restricted to occur within levels of other factors? (d) How may interaction terms be tested using permutations? (e) Which tests are exact and which are approximate? (f) What test-statistic should be used for the test? The purpose of this paper is to provide precisely such a clarification of these issues. First, we give a guideline for the construction of an exact permutation test, where possible, in any given situation for any factor in a complex ANOVA design. This guideline sheds light on how the design of the experiment or survey, including whether factors are fixed or random, nested or crossed with other factors, determines how valid permutation tests are to be done. No such general guideline, including the possibility for random effects, mixed models and nested hierarchies, currently exists in such a succinct form in the literature, to our knowledge. Second, we give results from specific sets of empirical Monte Carlo simulations done to compare the level accuracy and power of several possible permutation strategies (including the exact test, where possible) for individual terms in several models of two-way ANOVA. These were chosen so that, from them, important general statements and recommendations for multi-way ANOVA could be made. These are the most extensive simulations for permu- tation tests in complex ANOVA designs provided to date, to our knowledge, as well as being the only ones to include nested hierarchies and mixed or random effects models. Although we restrict our attention here to univariate ANOVA designs, the results will hold, in general, for the analogous permutation tests on multivariate response vectors. 2 CONSTRUCTING AN EXACT TEST We provide a guideline for constructing an exact permutation test for individual terms in a multi-factorial ANOVA. We first need to delimit the class of designs and introduce some definitions. We limit our discussion to equi-replicated orthogonal ANOVA designs, following Nelder (1964a 1964b). By orthogonal designs, we mean effects are independent, whether they be crossed or nested. In considering expected mean squares for terms in any ANOVA model (and for simulations) we also rely on the usual summation restrictions for fixed effects, i.e., that the sum individual treatment effects across all levels of a fixed factor equals zero. For alternative formulations of expected mean squares (i.e., without summation restrictions or for unbalanced designs) consult Searle et al. (1992). We consider that the same general guideline below should be followed for permutation tests for unbalanced designs, but do not provide any particular details of this here for particular cases of unbalance. MULTI-FACTORIAL ANALYSIS OF VARIANCE 87
The null hypothesis for a test of any particular factor, whose categorical levels we may generally call treatments, is that there is no treatment effect. More generally, the null hypo- thesis is that the errors associated with units in different treatments have the same distribution. In the case of more complex designs, the null hypothesis is conditional: given the other terms in the model (such as main effects in the test of an interaction), the errors associated with units in different treatments have the same distribution. We consider that exchangeability of units for permutation tests gains its validity by virtue of the statement of similar error distributions under the null hypothesis (e.g., Kempthorne, 1952). Although permutation tests avoid the assumption of normality, they still assume exchange- ability of relevant units under the null hypothesis. Exchangeability can be ensured through the random allocation of treatments to units in experimental design (e.g., Fisher, 1935 Kempthorne, 1955 Scheffe, �� 1959) or must be assumed for observational studies (e.g., Kempthorne, 1966). The assumption of exchangeability is tantamount to the assumption that errors are independent and identically distributed (������i.i.d.������). Note that this does not avoid the assumption of homogeneity of error variances (e.g., Boik, 1987 Hayes, 1996). Next, we provide definitions for the ������order������ of a term and what is meant by the ������exchangeable units������ for a test. By the ������order������ of a term, we mean the following: main effects are of first order, a two-way interaction term is of second order, etc. A term that is nested in another term has an order one more than the order of the term within which it is nested. ������Exchangeable units������ are identified in the denominator mean square of an F-ratio for any particular term. Consider Figure 1. Where the residual mean square is the mean square in the denominator for a test, this indicates that individual units (observations) are exchangeable under the null hypothesis (Fig. 1a). Where the denominator mean square of an F-ratio is the mean square of, for example, a nested term, then the exchangeable units con- sist of the units induced by the nested term. For example, in Figure lb there are eight ��2 4�� units induced by the nested term B, each unit consisting of two replicates. Where the denomi- nator of an F-ratio is the mean square of, for example, an interaction term, then the units con- sist of the blocks of cells identified by that interaction (e.g., if one factor has four levels and a second factor has three levels, then the interaction identifies 4 3 �� 12 cells which are the exchangeable units, Fig. 1c). A term whose mean-square appears as the denominator in the F-ratio of the test of another term in the model identifies the exchangeable units for that test. The exchangeable units have the same distribution under the null hypothesis. We provide the following general guideline: An exact permutation test for any term in an ANOVA model is achieved by permuting the exchangeable units identified by the denominator mean-square of the F-ratio and restricting permutations to occur within the levels of terms of either smaller order or of the same order as the term being tested. The guideline yields a permutation test based on restricted permutations of the exchange- able units. The guideline ensures that, under the null hypothesis, the likelihood of the data is invariant under these permutations, and consequently the test is exact. It encapsulates the idea that all unknown parameters in the model not being tested should be kept constant under per- mutation in order to isolate the test on the factor of interest. In the context of equi-replicated orthogonal ANOVA designs we identify two different aspects to guarantee the invariance: (i) components of variation that may contribute to variability in the term being tested, and (ii) other non-zero terms in the linear ANOVA model. The first of these is addressed by con- sidering the appropriate exchangeable units for the permutation test. The second is addressed by restricting the permutations of those units within levels of other factors. The construction of the F-ratio itself provides the necessary information concerning the exchangeable units by identifying the components of variation that contribute to variability in the term being tested. The ratio is constructed by reference to the expected mean squares 88 M. J. ANDERSON AND C. J. F. TER BRAAK
of individual terms. If components of variation that are not being tested are present in the numerator���s expected mean square, they must appear in the expected mean square of the denominator in order to isolate the component of variation of interest for the test using the F-ratio. The denominator mean square thus identifies the components of variation that contribute to the variability in the term being tested and, by this, the exchangeable units to be used in the permutation test. Some terms in the model may not contribute to variability in the term of interest but, if non-zero, would be confounded with (������mixed with������) the term under test. Restricted permu- tation within levels of these factors avoids this by ensuring that the sizes of their effects in the model remain constant under permutation. For example, in a two-way ANOVA without FIGURE 1 Diagram of permutation strategies for exact tests: a) test of A in a one-way ANOVA model b) test of factor A, the higher-ranked factor in a two-way nested design, with random factor B nested within factor A and c) test of factor A, a fixed factor, in a two-way crossed mixed model design (where factor B is random). MULTI-FACTORIAL ANALYSIS OF VARIANCE 89
interaction, permutations for a test of one factor should be restricted to occur within levels of the other factor for an exact test. For a discussion of restricted permutation for many kinds of complex ANOVA designs in psychology, see Edgington (1995). For a discussion of restricted permutation in the context of multiple regression, where predictor variables take several fixed repeated values, see Brown and Maritz (1982). The exact permutation test will be unique for any particular term in an ANOVA model. An exact permutation test, however, may not always exist, or may not have enough possible per- mutations to enable reasonable power for detecting alternatives. This leads us to consider approximate tests. 3 APPROXIMATE TESTS In situations where the strategy required for an exact test is not possible (i.e., where the restrictions required leave no possible permutations), then no exact test exists, but an appro- ximate permutation test may be used. This occurs, for example, in the case of interaction terms. Except in some special cases (Welch, 1990), it is not possible to construct an exact permutation test for an interaction using the F-statistic, as restricting permutations within levels of the main effects leaves no alternative possible permutations: the F-ratio obtained with the observed data is the only possibility. (For alternative approaches using other test statistics, see Pesarin, 2001). Another situation where an exact test is not feasible occurs when the combination of restrictions and exchangeable units results in there being too few possible permutations to obtain a reasonable test. For example, if there are only two levels of a nested factor (B) in each of two levels of a higher ranked (but lower order) factor (A), this leaves only 6 possible permutations for the test of A, 3 of which would give unique values for the test statistic and one of which is the observed value. This is clearly not sufficient to provide a reasonable test. For complex designs, there are essentially three different approaches for an approximate test: permutation of residuals under a reduced model (Still and White, 1981 Freedman and Lane, 1983), under the full model (ter Braak, 1992) or unrestricted permutation of raw data (Manly, 1997 Gonzalez and Manly, 1998). For a comparison of these approaches in the more general context of multiple regression, see Anderson and Legendre (1999) and Anderson and Robinson (2001). Permutation under the reduced model comes the closest to a conceptually exact test (Anderson and Robinson, 2001). For this reason, we here generally restrict our attention concerning permutation of residuals to reduced-model residuals, although full-model residuals can also be used in these situations, and would be expected to give highly comparable results (Anderson and Legendre, 1999). We also add that the problems raised by Kennedy and Cade (1996) for the method of permutation of raw data, which concerned the effects of outliers in nuisance variables for multiple regression, cannot occur in the context of ANOVA, where factors are categorical and thus do not contain outliers (Anderson and Legendre, 1999). In what follows, we describe and compare the possible permutation strategies (appro- ximate and exact, where possible) available for tests of individual terms in two-way ANOVA for nested, crossed fixed and crossed mixed models. These demonstrate the princi- ples involved in constructing an exact test, which we then extend to three-way models. In particular, we give results of empirical simulations that compare the various possible permutation strategies, demonstrating which of them provides the greatest empirical power for tests. 90 M. J. ANDERSON AND C. J. F. TER BRAAK
4 NESTED DESIGN Consider a nested (hierarchical) ANOVA design with the following linear model: yijk �� m �� Ai �� B��A��j��i�� �� eijk where m is the unknown population mean, B��A��j��i�� is the effect of the jth level of factor B within the ith treatment level of factor A, symbolised by Ai, and eijk is the unknown error associated with observation yijk . The number of levels in factors A and B will be designated by a and b, respectively, and the number of replications per AB combination by n. We con- sider A as a fixed factor, but for this model the same discussion will apply whether A is fixed or random. Factor B, being nested, will, for present purposes, always be considered random. Note that although factor B has b levels, it introduces a total of a b effects, denoted by B��A��j��i��, into the model. For the permutation test, we assume only that the e���s are independent and identically distributed (i.i.d.), but not that they are (necessarily) normal. Throughout, we shall denote sums of squares, mean squares and F-ratios for a particular term (say for factor B) as SSB MSB and FB, respectively. Also, the subscript ������R������ shall indicate the resi- dual, while the subscript ������T������shall indicate the total. For the above model, a test of factor B is provided by the statistic FB �� MSB=MSR and a test of factor A is provided by FA �� MSA=MSB (e.g. Kempthorne, 1952 Scheffe, �� 1959 Winer et al., 1991). 4.1 Test of the Nested Factor, B The null hypothesis for an exact test of the effect of factor B can be phrased: given the pre- sence of A, about which we make no assumption, the effect of B within A is not different from zero. Thus, the permutations of observations (y) are done across the levels of B, but these are restricted to occur within each of the levels of A. This means that whether or not factor A has an effect, we can test for significant variability due to factor B. This strategy is consistent with the guideline. This provides an exact test because (i) SST stays constant across all permutations, (ii) restricting permutations within levels of A means that SSA remains constant across all permutations, and (iii) SST �� SSR �� SSB �� SSA. Thus, permutation mixes (exchanges) variation only between SSB and SSR, which is exact for the test of FB �� MSB=MSR. For an approximate test, one can calculate and permute the reduced-model residuals rijk �� yijk yi::, y where yi:: y is the mean of the observations in level i of factor A. This amounts to estimating and ������removing������ the effect of A by subtracting the mean of the appropriate level of A from each observed value. Permuting these residuals will yield an approximate test. An alternative approximate test of factor B is provided by simply permuting all observa- tions without restriction. This mixes variability among all terms in the model, including variability due to factor A (i.e., SSA), which is not held constant under permutation. However, as this ������mixing������ of SSA under permutation impinges on each of SSB and SSR, the ratio FB �� MSB=MSR, on average, is unaffected, giving a reasonable approximate test (Gonzalez and Manly, 1998 Anderson and Robinson, 2001). Simulations were done to demonstrate the type I error and relative power of the above pro- posed strategies of permutation to detect variability due to factor B. Data were simulated according to the above model with a �� 4 b �� 5 and the sample size within each cell (n) was set at n �� f2 5 or 10g, where the effect of factor A was non-zero: Ai �� f33:3 33:3 33:3 100g. The additive effect of levels of the random factor B were chosen randomly from a normal distribution with mean zero and gradual increases in standard deviation. Note that b �� 5 different values of B��A��j��i�� were chosen within each level of factor A. MULTI-FACTORIAL ANALYSIS OF VARIANCE 91