Statistical Primer for Cardiovasc...
Hypothesis Testing Means Roger B. Davis, ScD Kenneth J. Mukamal, MD, MPH Iabout n most biomedical research, investigators hypothesize the relationships of various factors, collect data to test those relationships, and try to draw conclusions about those relationships from the data collected. In many cases, investigators test relationships by comparing the average level of a factor between 2 groups or between 1 group and a standard reference. This framework is as true for understand- ing the basic role of cardiac myosin binding protein-C phosphorylation in cardiac physiology1 as it is for evaluating non���high-density lipoprotein cholesterol (HDL-C) as a pre- dictor of myocardial infarction in large groups of individu- als.2 In this article we describe hypothesis testing, which is the process of drawing conclusions on the basis of statistical testing of collected data, and the specific approach used to test means (or average levels of a collected data element). These concepts are covered in detail in many statistical textbooks at various levels, including Pagano and Gauvreau,3 Zar,4 and Kleinbaum et al.5 Hypothesis Testing The purpose of statistical inference is to draw conclusions about a population on the basis of data obtained from a sample of that population. Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population, ie, it provides a method for understanding how reliably one can extrapolate observed findings in a sample under study to the larger population from which the sample was drawn. The investigator formulates a specific hypothesis, evaluates data from the sample, and uses these data to decide whether they support the specific hypothesis. The first step in testing hypotheses is the transformation of the research question into a null hypothesis, H0, and an alternative hypothesis, HA.6 The null and alternative hypoth- eses are concise statements, usually in mathematical form, of 2 possible versions of ���truth��� about the relationship between the predictor of interest and the outcome in the population. These 2 possible versions of truth must be exhaustive (ie, cover all possible truths) and mutually exclusive (ie, not overlapping). The null hypothesis is conventionally used to describe a lack of association between the predictor and the outcome the alternative hypothesis describes the existence of an association and is typically what the investigator would like to show. The goal of statistical testing is to decide whether there is sufficient evidence from the sample under study to conclude that the alternative hypothesis should be believed. Hypothesis testing has been likened to a criminal trial, in which a jury must use evidence to decide which of 2 possible truths, innocence (H0) or guilt (HA), is to be believed. Just as a jury is instructed to assume that the defendant is innocent unless proven otherwise, the investigator should assume there is no association unless there is strong evidence to the contrary. A jury���s verdict must be either guilty or not guilty, in which case a not-guilty verdict does not equal innocence. Rather, it indicates that the burden of proof has not been met. Similarly, an investigator can only reject H0 or fail to reject it failure to reject does not prove that the null H0 is true. In a criminal trial in the United States, the required burden of proof is ���beyond a reasonable doubt.��� For hypothesis testing, the investigator sets the burden by selecting the level of significance for the test, which is the probability of rejecting H0 when H0 is true. The standard value chosen for level of significance is 5% (ie, P 0.05), which is a much weaker standard than used in the criminal justice system. This standard means that even if no association between predictor and outcome exists in the population, the investigator is willing to accept a 1 in 20 chance of a false-positive conclusion that an association does exist. Just as hypothesis testing can reject a true null hypothesis (referred to as a type I error), it can fail to reject H0 when the predictor and outcome are associated (type II error). The probability of such a false-negative conclusion is called . The quantity (1 ) is called the power of the test and is simply the probability of drawing the correct conclusion (ie, rejecting H0) when an association between predictor and outcome actually does exist. In most cases, investigators are equally interested in whether a predictor leads to higher or lower levels of the outcome. In this situation, we specify a 2-sided statistical test, in which we accept a combined rate of false-positives (for both the higher and lower level of the outcome) of only 5%. If only 1 direction is of interest, a 1-sided test may be appropriate, but this choice requires strong justification. Because a 1-sided test is less stringent, many readers (and journal editors) appropriately view 1-sided tests with skepti- cism.7 Two-sided tests should also be considered the default From the Division of General Medicine and Primary Care, Beth Israel Deaconess Medical Center, Boston, Mass. Correspondence to Roger B. Davis, ScD, Division of General Medicine and Primary Care, Beth Israel Deaconess Medical Center, 330 Brookline Ave, RO-108, Boston, MA 02215. E-mail rdavis@bidmc.harvard.edu (Circulation. 2006 114:1078-1082.) �� 2006 American Heart Association, Inc. Circulation is available at http://www.circulationaha.org DOI: 10.1161/CIRCULATIONAHA.105.586461 1078 Statistical Primer for Cardiovascular Research
option because an investigator���s intuition about how a study will come out may be incorrect. If an investigator chooses a 1-sided test but observes results opposite to those expected, the strongest statement that can be made is that the null hypothesis was not rejected. For these reasons, the investigator should always specify the hypotheses, the methods of analysis, and the level of significance before initiating the research. Means In clinical practice and in biomedical research, we collect substantial amounts of numerical data. To analyze such data correctly, it is critical to recognize the different types of numerical data and the various methods specific to each type. Stevens8 proposed 4 classes of measurement scales: nominal scales use numbers strictly as labels for categories with no natural ordering ordinal scales represent categories with a natural ranking interval scales use numbers in a truly quantitative sense in which differences between observations are meaningful (eg, temperature) and ratio scales are interval scales that also have a meaningful zero value (eg, height). The mean of a measure for a population is simply its arithmetic average. It is usually denoted by . The mean from the sample that we actually observe, usually designated by x, �� is the sum of the observed measurement for each individual in the sample, divided by n, the number in the sample. The mean is an appropriate measure for ordinal and ratio scales but not for nominal or ordinal scales.4 The Figure shows 2 theoretical distributions of data. The first pattern follows a normal distribution. The distribution is sym- metrical (ie, the right-hand side is a mirror image of the left-hand side), and the mean and median occur at the same value. Many characteristics we observe approximate this pattern, such as height or HDL-C. The second distribution is skewed and asymmetrical there are more observations far to the right of the mean than there are far to the left. The mean of this distribution is larger than its median, because the extreme values to the right increase the mean but do not affect the median. This general pattern is seen in the distributions of C-reactive protein, triglyc- erides, and coronary artery calcification, as well as medical costs and hospital length of stay. Analysts often perform logarithmic transformation of right-skewed variables like these to improve their fit to a normal distribution. Although the mean can be skewed by extreme values, there are important reasons why it is the most commonly used measure of ���center��� in statistical testing. First, when the distribution of a measurement is reasonably symmetrical, statistical tests of the mean tend to have the most power (ie, when differences between groups exist, these tests are most likely to detect them). Second, for some measurements, we may want the center to reflect the pull of extreme values. For example, when measuring health care costs, we may want the ���average��� expenditure to reflect the almost inevitable pres- ence of a few subjects with very high costs.9 In such a case, the mean multiplied by the sample size recreates the total expenditure in the sample, but the median does not. One-Sample Tests In some research projects, the study design includes only a single sample, and the goal may be to determine whether the outcome measure for the population from which the sample was drawn has same mean as some standard population. Determining an appropriate standard for comparison for these designs is often an issue. Nonetheless, when well-established standards exist, investigators may wish to use these standards for maximal comparability. In this situation, we might per- form a 1-sample (not 1-sided) t test. To provide a concrete example, we examine data from a trial of black tea consumption in 28 adults (Table). As a preliminary step, we might be interested in testing whether the population from which these individuals derive tends to have baseline levels of HDL-C that differ from the overall US population as a test of their generalizability. The distribution of HDL-C in the US adult population is well characterized and has a mean of 50.7 mg/dL.10 Therefore, we would want to determine whether the data from our 28-person sample support a conclusion that the population from which these older adults came has HDL-C levels that differ from 50.7 mg/dL. We would state the null and alternative hypotheses as follows: H0: HDL-C 50.7 HA: HDL-C 50.7 To decide which of these hypotheses we believe, we first calculate the mean and standard deviation (SD a measure of the ���spread��� or variability of the measurement) of baseline HDL-C in the sample. These are called x �� and s, respectively. x (sample mean) 63.2 s (sample standard deviation) 13.7 n (sample size) 28 Hypothetical frequency distributions of variables with normal (top) and right-skewed (bottom) distributions. Davis and Mukamal Hypothesis Testing: Means 1079