Overview of quantitative measurement methods. Equivalence, invariance, and differential item functioning in health applications
- ISSN: 00257079
- DOI: 10.1097/01.mlr.0000245452.48613.45
- PubMed: 17060834
Abstract
BACKGROUND: Reviewed in this article are issues relating to the study of invariance and differential item functioning (DIF). The aim of factor analyses and DIF, in the context of invariance testing, is the examination of group differences in item response conditional on an estimate of disability. Discussed are parameters and statistics that are not invariant and cannot be compared validly in crosscultural studies with varying distributions of disability in contrast to those that can be compared (if the model assumptions are met) because they are produced by models such as linear and nonlinear regression. OBJECTIVES: The purpose of this overview is to provide an integrated approach to the quantitative methods used in this special issue to examine measurement equivalence. The methods include classical test theory (CTT), factor analytic, and parametric and nonparametric approaches to DIF detection. Also included in the quantitative section is a discussion of item banking and computerized adaptive testing (CAT). METHODS: Factorial invariance and the articles discussing this topic are introduced. A brief overview of the DIF methods presented in the quantitative section of the special issue is provided together with a discussion of ways in which DIF analyses and examination of invariance using factor models may be complementary. CONCLUSIONS: Although factor analytic and DIF detection methods share features, they provide unique information and can be viewed as complementary in informing about measurement equivalence.
Overview of quantitative measurement methods. Equivalence, invariance, and differential item functioning in health applications
Overview of Quantitative Measurement Methods
Equivalence, Invariance, and Differential Item Functioning
in Health Applications
Jeanne A. Teresi, EdD, PhD
Background: Reviewed in this article are issues relating to the
study of invariance and differential item functioning (DIF). The aim
of factor analyses and DIF, in the context of invariance testing, is the
examination of group differences in item response conditional on an
estimate of disability. Discussed are parameters and statistics that
are not invariant and cannot be compared validly in crosscultural
studies with varying distributions of disability in contrast to those
that can be compared (if the model assumptions are met) because
they are produced by models such as linear and nonlinear regression.
Objectives: The purpose of this overview is to provide an integrated
approach to the quantitative methods used in this special issue to
examine measurement equivalence. The methods include classical test
theory (CTT), factor analytic, and parametric and nonparametric ap-
proaches to DIF detection. Also included in the quantitative section is
a discussion of item banking and computerized adaptive testing (CAT).
Methods: Factorial invariance and the articles discussing this topic
are introduced. A brief overview of the DIF methods presented in
the quantitative section of the special issue is provided together with
a discussion of ways in which DIF analyses and examination of
invariance using factor models may be complementary.
Conclusions: Although factor analytic and DIF detection methods
share features, they provide unique information and can be viewed
as complementary in informing about measurement equivalence.
Key Words: measurement equivalence, invariance, differential
item functioning, item response theory, health assessment methods
(Med Care 2006;44: S39–S49)
The quantitative section of this volume contains articles de-scribing methods for examination of measurement equiva-
lence. A purpose of this article is to introduce each method and
to provide an integrated overview of the methods. The methods
include classical test theory (CTT), factor analytic, and paramet-
ric and nonparametric approaches to examination of differential
item functioning (DIF). Also included in the quantitative section
is a discussion of item banking and computerized adaptive
testing (CAT). Commentaries accompany the articles. The
reader may obtain additional details about the application of
each method by referring to the articles contained within this
special issue and cited subsequently.
Examination of item invariance and item and measure-
level DIF is important in the context of health disparities re-
search and individual assessment. Applications of CAT are
appearing to an increasing degree in the noneducational mea-
surement literature. Such a method offers an intuitively appeal-
ing approach to tailoring tests for individuals, minimizing the
assessment burden on these respondents while providing a
method through which different items, perhaps more uniquely
suited to members of specific cultural groups, can be selected
automatically for administration. Using this approach, bias in
tests (theoretically) might be minimized because if an item is
appropriate for one group but has a different meaning for
members of a different group (items are not conceptually or
semantically equivalent), a substitute item could be selected
for the other group. Thus, items with known properties with
respect to different groups could be banked for use in CAT.
Individualized measures could be used and fewer items selected.
However, as discussed by Krause in this issue,1 combining items
with common meaning with those unique to specific groups may
result in conceptual drift rendering interpretation of group dif-
ferences difficult at best. As described by Hahn et al,2 and in the
comment by Reeve,3 item banks should contain items that have
been studied carefully in terms of conceptual equivalence, in-
variance, and DIF, because item banks with items that exhibit a
high magnitude of consequential invariance will be compro-
mised as will the CAT applications using such banks. Whether
or not CAT attains wider use in health assessment, measures
should be examined for cultural equivalence. Described subse-
quently, and in the quantitative section of this volume, are issues
relating to such study of invariance and DIF.
Conceptual Definitions of Invariance
Common to all methods for establishing measurement
equivalence is the concept of invariance. Invariance is de-
fined in various ways in both the statistical and the nonstatis-
tical literature. The term has been used to describe both the
From the Columbia University Stroud Center and Faculty of Medicine, New
York State Psychiatric Institute and the Research Division, Hebrew
Home for the Aged at Riverdale, Riverdale, New York.
Supported in part by the Columbia University Resource Center for Minority
Aging Research (RCMAR), National Institute on Aging, National Insti-
tute of Nursing Research, and the National Center for Minority Health
and Health Disparities (AG15294).
Reprints: Jeanne A. Teresi, EdD, PhD, Research Division, HHAR, 5901
Palisade Avenue, Riverdale, NY 10471. E-mail: teresimeas@aol.com;
jat61@columbia.edu.
Copyright © 2006 by Lippincott Williams & Wilkins
ISSN: 0025-7079/06/4400-0039
Medical Care Volume 44, Number 11 Suppl 3, November 2006 S39
and the empiric tests of the performance of items and scales
across subgroups. Related to these concepts is the relation-
ship between the theoretical formulations and their realization
in practice. Some parameters and statistics do not possess the
theoretical properties, which permit comparison across sub-
groups differing in terms of level of disability or prevalence
of health disorder. This has resulted in erroneous compari-
sons and conclusions about the properties of a measure when
used in a particular subgroup.
Invariance of measurement models can be discussed in
terms of the preservation of the mapping of objects to
numbers after rescaling or transformations; for example, if
the ordering permits the addition of numbers, then rescaling
length (from inches to centimeters) should not alter the
relationship between the numbers and objects.4 The concept
of invariance is fundamental to measurement because it
relates to the permissible statistical operations that can be
performed with respect to comparison of factor loadings or
parameters derived from different models. Constraints are
used in factor analytic and other models to define a common
scale (often called resolving indeterminacy); different scal-
ings would lead to different conclusions if invariance were
not a property of the procedure. A technical treatment of
equivariance and invariance with respect to factor loadings
and standard errors can be found in Yuan and Bentler,5 who
show that if standard errors do not meet the statistical defi-
nition of invariance (with respect to commonly used trans-
formations in factor analyses), the significance tests examin-
ing the factor loadings are incorrect. The authors point out
that lack of invariance of the standard error of factor loadings
leads to a state in which their significance would depend on
the scale used in measurements and conclusions about sig-
nificance would be arbitrary.
Parameter invariance can be defined as equivalent popu-
lation parameters (eg, factor loadings or item difficulties and
discriminations) across different populations.6 However, as pre-
sented previously, for parameters to be compared across groups
to determine whether they are equivalent, they must be on the
same scale. Because estimates, rather than the actual population
parameters, are observed, such invariance is not a guarantee, but
an idealization7 that applies theoretically if the model fits.8
As explicated previously, invariance of parameters and
statistics have statistical referents: (1) the theoretical distri-
butional properties and the retention of specified properties of
parameters after rescaling and transformation presented as
invariance and equivariance, eg,5 and (2) the dependence of
the estimates on marginal probabilities or distributions that
can be influenced by differences in the base rate or preva-
lence. Although these factors are interconnected, they can be
viewed as separate ways of presenting theoretical arguments
regarding statistical invariance.
Comparison of Item Parameters and Summary
Statistics
Variant Parameters and Statistics
Consideration of parameters and statistics that are not
invariant and cannot be compared validly in crosscultural
studies with varying distributions of disability has received
wide attention, eg,9–12 and is discussed by Dorans and Kulick
in this issue.13 Reliant as they are on average interitem
correlations and thus on the degree of heterogeneity in a
particular sample, classical test theory-derived reliability co-
efficients such as Coefficient alpha14 and various forms of the
intraclass correlation coefficient15,16 will vary across popula-
tions varying in item prevalences, rendering problematic
comparisons across samples drawn from these populations.
The same situation obtains for interrater reliability coeffi-
cients such as kappa.17 Marginal totals can affect both the
maximum value of Po (observed total agreement) and Pe (the
correction factor for chance agreement calculated using mar-
ginal probabilities).15,16
The issue of comparing reliabilities becomes especially
problematic in the health sciences where items with low
prevalences, indicative of greater severity, eg, thoughts of
suicide, blurred vision, high systolic blood pressure, can be
the norm. Similarly, the item prevalences for hypertension-
related items may be higher in some samples of blacks than
of Latinos, resulting in lower item variances, covariances,
corrected item-total correlations, and alpha coefficients for
the Latino group. Because there is no theoretical expectation
of invariance for CTT parameters, the comparison of such
coefficients as evidence of differential item functioning is
problematic, as is shown by Hambleton and colleagues10 and
by Dorans and Kulick in this issue.13
Invariant Parameters and Statistics
Discussed previously were summary measures that
were not invariant, which include various forms of the cor-
relation coefficient. For example, the phi coefficient (describ-
ing the correlation between 2 binary variables) is not invari-
ant because it relies on the marginal values (the totals from
the rows and columns) rather than on the entries in the cells
and is, therefore, affected by the base rate. In contrast, the
odds ratio (used in several DIF detection methods described
subsequently) possesses, under certain circumstances,18 in-
variance properties and thus can be compared.
Parameters produced by linear regression as well as by
nonlinear regression models such as item response theory
(IRT), eg, the slope and intercept, are invariant if the model
assumptions are met. Unlike the correlation coefficient, the
slope is not affected by the variability of the population being
sampled; however, an important point is that the proper
estimation of the regression line requires a heterogeneous
sample.10 Because of the properties of IRT parameters (their
theoretical invariance that renders them distribution free),
empiric investigation of the performance of items across sub-
groups (DIF) is permitted. Similarly, the factor loadings (’s)
from confirmatory factor analyses (CFA) are statistically, theo-
retically invariant because they represent regression coefficients
in the relationship between observed indicators and the latent
attribute. Therefore, they can be compared.
In the context of IRT,8 the item characteristic curve is
the regression of item score on (the estimate of latent
disability) and, because of its definition, the item character-
istic function necessarily remains invariant across groups so
Teresi et al Medical Care Volume 44, Number 11 Suppl 3, November 2006
© 2006 Lippincott Williams & WilkinsS40
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


