CLASSIFICATION AND REGRESSION TRE...
3178 Ecology, 81(11), 2000, pp. 3178���3192 q 2000 by the Ecological Society of America CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS GLENN DE���ATH1 AND KATHARINA E. FABRICIUS2 1Tropical Environment Studies and Geography, James Cook University, Townsville, Qld 4811, Australia 2Australian Institute of Marine Science, P.M.B. 3, Townsville, Qld 4810, Australia Abstract. Classification and regression trees are ideally suited for the analysis of com- plex ecological data. For such data, we require flexible and robust analytical methods, which can deal with nonlinear relationships, high-order interactions, and missing values. Despite such difficulties, the methods should be simple to understand and give easily interpretable results. Trees explain variation of a single response variable by repeatedly splitting the data into more homogeneous groups, using combinations of explanatory var- iables that may be categorical and/or numeric. Each group is characterized by a typical value of the response variable, the number of observations in the group, and the values of the explanatory variables that define it. The tree is represented graphically, and this aids exploration and understanding. Trees can be used for interactive exploration and for description and prediction of patterns and processes. Advantages of trees include: (1) the flexibility to handle a broad range of response types, including numeric, categorical, ratings, and survival data (2) invariance to monotonic transformations of the explanatory variables (3) ease and ro- bustness of construction (4) ease of interpretation and (5) the ability to handle missing values in both response and explanatory variables. Thus, trees complement or represent an alternative to many traditional statistical techniques, including multiple regression, analysis of variance, logistic regression, log-linear models, linear discriminant analysis, and survival models. We use classification and regression trees to analyze survey data from the Australian central Great Barrier Reef, comprising abundances of soft coral taxa (Cnidaria: Octocorallia) and physical and spatial environmental information. Regression tree analyses showed that dense aggregations, typically formed by three taxa, were restricted to distinct habitat types, each of which was defined by combinations of 3���4 environmental variables. The habitat definitions were consistent with known experimental findings on the nutrition of these taxa. When used separately, physical and spatial variables were similarly strong predictors of abundances and lost little in comparison with their joint use. The spatial variables are thus effective surrogates for the physical variables in this extensive reef complex, where infor- mation on the physical environment is often not available. Finally, we compare the use of regression trees and linear models for the analysis of these data and show how linear models fail to find patterns uncovered by the trees. Key words: analysis of variance CART classification tree coral reef Great Barrier Reef habitat characteristic Octocorallia regression tree soft coral surrogate. INTRODUCTION Ecological data are often complex, unbalanced, and contain missing values. Relationships between vari- ables may be strongly nonlinear and involve high-order interactions. The commonly used exploratory and sta- tistical modeling techniques often fail to find mean- ingful ecological patterns from such data. Classifica- tion and regression trees (Breiman et al. 1984, Clark and Pregibon 1992, Ripley 1996) are modern statistical techniques ideally suited for both exploring and mod- eling such data, but have seldom been used in ecology (Staub et al. 1992, Baker 1993, Rejwan et al. 1999). Trees explain variation of a single response variable Manuscript received 16 November 1998 revised 4 November 1999 accepted 6 November 1999. by one or more explanatory variables. The response variable is usually either categorical (classification trees) or numeric (regression trees), and the explana- tory variables can be categorical and/or numeric. The tree is constructed by repeatedly splitting the data, de- fined by a simple rule based on a single explanatory variable. At each split the data is partitioned into two mutually exclusive groups, each of which is as ho- mogeneous as possible. The splitting procedure is then applied to each group separately. The objective is to partition the response into homogeneous groups, but also to keep the tree reasonably small. The size of a tree equals the number of final groups. Splitting is con- tinued until an overlarge tree is grown, which is then pruned back to the desired size. Each group is typically characterized by either the distribution (categorical re- sponse) or mean value (numeric response) of the re-
November 2000 3179 CLASSIFICATION AND REGRESSION TREES sponse variable, group size, and the values of the ex- planatory variables that define it. The way that explanatory variables are used to form splits depends on their type. For a categorical explan- atory variable with two levels, only one split is pos- sible, with each level defining a group. For categorical variables with .2 levels, any combinations of levels can be used to form a split, and for k levels, there are 2k21 2 1 possible splits. For numeric explanatory var- iables, a split is defined by values less than, and greater than, some chosen value. Thus, only the rank order of numeric variables determines a split, and for u unique values there are u 2 1 possible splits. From all possible splits of all explanatory variables, we select the one that maximizes the homogeneity of the two resulting groups. Homogeneity can be defined in many ways, with the choice depending on the type of response var- iable. Trees are represented graphically, with the root node, which represents the undivided data, at the top, and the branches and leaves (each leaf represents one of the final groups) beneath. Additional information can be displayed on the tree, e.g., summary statistics of nodes, or distributional plots. We will show how trees can deal with complex eco- logical data sets using soft coral (Cnidaria: Octocor- allia) survey data from the Australian central Great Barrier Reef. This, together with a detailed exposition of trees, follows this introduction. First, we describe the soft coral survey data, and ecological issues that we investigate with trees. We then illustrate the basics of classification and regression trees with two analyses of a soft coral species. A more detailed discussion of trees follows, and includes: (1) exploration, descrip- tion, and prediction of data (2) technical aspects of growing trees with different splitting criteria (3) prun- ing trees to size by cross-validation and (4) data trans- formations, missing values, and tree diagnostics. Tree analyses of the soft coral data then address the follow- ing ecological issues: (1) relationships between phys- ical and spatial environmental variables, (2) habitat characteristics associated with aggregations of three soft coral taxa, and (3) comparison of physical and spatial variables as predictors of soft coral abundance. Finally, we compare the performance of trees to equiv- alent linear model analyses. THE SOFT CORAL STUDY Soft corals (class Anthozoa, Octocorallia: Order Al- cyonacea) occur in high abundances on many types of coral reefs. They can numerically dominate reefs in turbid nearshore regions, as well as in clear water reefs away from coastal influences (Dinesen 1983, Fabricius 1997). Abundances of soft corals are strongly related to their physical environment (Fabricius and De���ath 1997), but their role in reef communities is not well understood. We analyze three groups of taxa: (1) Efflatounaria (family Xeniidae) comprises three species (Gohar 1939, Versefeldt 1977) which are not reliably distin- guished (2) Sinularia spp. (family Alcyoniidae Ver- sefeldt 1980) comprises five ill-defined species with very similar morphology and distribution, and includes S. capitalis and S. polydactyla and (3) the distinct species Sinularia flexibilis (Versefeldt 1980). Efflatounaria is locally dominant in clear offshore waters, whereas Sinularia spp. and S. flexibilis are high- ly abundant and conspicuous nearshore taxa. They can form dense aggregations, to the extent of monopolizing space and excluding the reef-building hard corals on the scale of thousands of square meters (Fabricius 1998). Additionally, we use Asterospicularia laurae (family Asterospiculariidae Utinomi 1951) as an example of an uncommon species. It is one of the few soft corals that are reliably identified to species level in the field. Zonations along the gradients of depth and distance to land have been extensively used to explain patterns in abundances of individual taxa on the Great Barrier Reef (Done 1982, Dinesen 1983). However, spatial var- iables are only proxies for a range of physical envi- ronmental variables, with which they are highly cor- related. The relationships between spatial and physical variables are often complex. Hence the question of which physical variables determine the distribution of a taxon often remains unresolved���a question we at- tempt to address in this study. Data comprising abundances of 38 genera of soft coral, four physical and five spatial variables (Table 1), were collected during surveys of 374 sites at 92 lo- cations on 32 reefs within the Australian central Great Barrier Reef (Table 1 and Fig. 1). Each site was visually surveyed by one experienced observer (K. Fabricius), by scuba diving over typically 300���500 m (l 900��� 2000 m2), for 15���20 min, within each of five defined depth ranges. The distribution of sites was highly un- balanced with respect to their defining characteristics (Table 2). EXAMPLES OF CLASSIFICATION AND REGRESSION TREES A regression tree example As an illustrative example, we use the ratings of abundances (row 1 of Table 1) of Asterospicularia lau- rae as the numeric response variable (Fig. 2a). The species is uncommon, occurring on 15% of sites (mean rating 5 0.241, n 5 373) with ,1% of sites having a rating .2. The explanatory variables used in the model are cross-shelf position, location, and depth all of which appear in the tree. Splits minimize the sums of squares (SS) within groups. The first split is based on shelf position, with inner- and mid-shelf reefs in the left branch, and outer-shelf reefs in the right branch. The left node is strongly homogeneous, and is not sub- sequently divided, forming a leaf with mean rating of
3180 GLENN DE���ATH AND KATHARINA E. FABRICIUS Ecology, Vol. 81, No. 11 TABLE 1. Description of the variables used in the soft coral study. The character of variables is denoted by B 5 biotic, P 5 physical, or S 5 spatial and the type by N 5 numeric or C 5 categorical. Variable Character Type Values Abundances of soft corals (38 taxa) Sediment Visibility Wave action Slope angle Cross-shelf position B P P P P S N C N C N C 0 (absent), 1 (few), 2 (uncommon), 3 (com- mon), 4 (abundant), 5 (dominant) 0 (none), 1 (thin), 2 (moderate), 3 (thick) 1���33 m 0 (none), 1 (moderate), 2 (strong) 0���908 in 58 increments Inner, mid-, or outer shelf Reef type Within-reef location Depth zone Reef identity S S S S C C C C Fringing around islands or platform Front, back, channel, flank 0���1, 1���3, 3���8, 8���13, 13���18 m 32 levels Note: Four of the 38 soft coral taxa were analyzed in terms of the physical and spatial variables. FIG. 1. Schematic representation of cross-shelf position (inner, mid, and outer), reef type (platform or fringing), and types of site location on the reef (front, back, flank, and channel). Fringing reefs form around islands, whereas platform reefs rise from the sea floor. In the survey area fringing reefs occurred only on the inner shelf. Fronts of reefs face the prevailing wind, and backs are on the leeward side, the two being joined by flanks. Channel sites occur on fringing reefs between closely located islands and typically have high currents. 0.038. For regression trees, the proportion of the total sum of squares explained by each split is important information, and could be displayed on the tree. How- ever, we can also represent this graphically by the rel- ative lengths of the vertical lines associated with each split (Fig. 2a) a practice we use for all trees in this paper. Continuing with the right branch comprising all outer reefs, it is now divided into back and flank reef locations to the left and front locations to the right there are no channel sites on outer-shelf reefs. The splitting process is repeated, separating front-reef sites of depths above and below 3 m, which completes the tree with four leaves, and 49.2% of the total sum of squares is explained. The bar charts at each leaf show the distribution of observed ratings (0���3). They show A. laurae to be relatively common on the fronts of outer-shelf reefs, particularly at depths $3 m, but vir- tually absent from inner- and mid-shelf reefs. A classification tree example For the example in Fig. 2b, the response variable is the presence���absence of A. laurae. In this case, the tree is identical in structure to the regression tree, but the splits have relatively different strengths, as represented by their vertical lengths. Splits are based on the pro- portions of presences and absences in the groups. The leaves of the tree are characterized by their dominant category (present or absent), and the proportion of sites of that category e.g., for the leftmost leaf, A. laurae is absent on 97% of inner and mid-shelf reefs (n 5 263). When the response has more than two categories, leaves are characterized by their dominant category and the proportions in each category. A classification tree, treating the ratings of A. laurae as four distinct cate- gories (not shown), gave identical leaves to the pres- ence���absence tree, but had a stronger split for depth.