Statistics from the inside. 16. Multiple regression (2).

M J Healy

Journal ArticleOPEN ACCESS

Statistics from the inside. 16. Multiple regression (2).

Healy M

Archives of Disease in Childhood (1995) 73(3) 270-274

DOI: 10.1136/adc.73.3.270

N/ACitations

13Readers

Abstract

Regression equations with two or more x's In a previous article in this series I described some aspects of the simple regression model, where the mean of a variate y is related to a quantity x in a linear (straight line) fashion. With some formality we can write E(y x)=a+x In this equation ot is the value of y where the line crosses the y axis, and is often called the intercept; L is the slope of the line, the amount of increase in y per unit increase in x. E(y x) is mathematicians' notation for a mean value (misleadingly called an expectation) and the vertical bar shows that the mean is that ofy for a particular value of x, a conditional mean (com-pare the conditional probabilities that I wrote about in an earlier note). The usual name for y is the dependent variate; x goes by various names, notably the predictor or covariate or (misleadingly, as we shall see) the independent variate. Obvious examples are where y might be the response to a drug and x the dose; or y the head circumference of a baby and x the baby's weight. Notice the important assumption of linearity; this means that a given change in x corresponds to a fixed change in y, no matter where it starts from. One use of the regression equation is to predict the value ofy that might correspond to an observed value of x on a future occasion. The prediction will not of course be perfect, and the observed value ofy will differ from that which is predicted by the equation. The difference is usually called a residual. The sizes of the residuals can be summarised by quoting their standard deviation (their mean is exactly zero), and this is called the residual standard deviation or residual standard error. Roughly speaking, around 95% of the residuals can be expected to fall short of twice the residual standard deviation. It is a natural extension of this idea to use two or more covariates simultaneously to predict the value of y. This leads to a multiple regression equation. Starting simply, consider the miniature example in table 1 which shows measurements of height, weight, and chest circumference of 10 army cadets. The mean chest circumference is 102-6 cm with a standard deviation of 6-78 cm and this suggests that most future measurements of chest circumference from the same population might fall in the range mean ±2 SD, 89 to 116 cm, a width of 27 cm. The standard deviation thus measures our degree of uncertainty concerning the chest circumference of a random individual from this population. We could try to reduce this by predicting chest circumference from height or from weight by doing simple regressions. The standard calculations show that the regression coefficient on height is-0A43 (SE 048) giving a residual standard deviation of 6&86 cm; that on weight is +092 (SE 0.17) with residual standard deviation of 3-32 cm. Comparing the coefficients with their standard errors, it appears that height is useless as a predictor in this small sample. Weight on the other hand may be quite successful, with a highly significant regression coefficient. These conclusions are confirmed by the reduction (or lack of it) in the residual standard deviation-compared with the previous value of 27 cm, the ±2 SD interval measures 27-4 cm when the subject's height is allowed for, 13-3 cm when weight is allowed for. What about using them both simultane-ously? We need to estimate a relationship of the form E (y XIX2)=t+PlXl+P2X2 where y stands for chest circumference and x1,x2 for height and weight respectively. With only two covariates this is not an impossible task for a pocket calculator but multiple regression calculations can become quite heavy and a good computer package is desirable. Most packages provide the same items of information in different guises; I have chosen the Nanostat package (Alphabridge Ltd, 26 Downing Court, London WC1N ILX) with which I am personally familiar. The computer output is shown in table A. This is rather a formidable amount of information for a fairly simple problem and it is important not to be intimidated by it. It is most easily read from the bottom up. You will see that the estimated equation can be written as Chest circumference = 132*66-0-54746Xheight+095709Xweight Table 1 Measurements on army cadets Height (cm) Weight (kg) Chest circumference (cm) 167-9 71-8 107-3 183-8 75-1 105-2 172-9 58-0 93-4 175-5 58-4 91-9 176-4 67-7 99-8 168-5 75-2 113-4 178-0 71-3 103-7 178-0 67-3 98-1

Cite

CITATION STYLE

APA

Healy, M. J. (1995). Statistics from the inside. 16. Multiple regression (2). Archives of Disease in Childhood, 73(3), 270–274. https://doi.org/10.1136/adc.73.3.270

Statistics from the inside. 16. Multiple regression (2).

Abstract

Cite

Register to see more suggestions