Sign up & Download
Sign in

High stakes testing in higher education and employment: appraising the evidence for validity and fairness.

by Paul R Sackett, Matthew J Borneman, Brian S Connelly
American Psychologist (2008)

Abstract

The authors review criticisms commonly leveled against cognitively loaded tests used for employment and higher education admissions decisions, with a focus on large-scale databases and meta-analytic evidence. They conclude that (a) tests of developed abilities are generally valid for their intended uses in predicting a wide variety of aspects of short-term and long-term academic and job performance, (b) validity is not an artifact of socioeconomic status, (c) coaching is not a major determinant of test performance, (d) tests do not generally exhibit bias by underpredicting the performance of minority group members, and (e) test-taking motivational mechanisms are not major determinants of test performance in these high-stakes settings.

Cite this document (BETA)

Available from American Psychologist
Page 1
hidden

High stakes testing in higher education and employment: appraising the evidence for validity and fairness.

High-Stakes Testing in Higher Education
and Employment
Appraising the Evidence for Validity and Fairness
Paul R. Sackett, Matthew J. Borneman, and Brian S. Connelly
University of Minnesota, Twin Cities Campus
The authors review criticisms commonly leveled against
cognitively loaded tests used for employment and higher
education admissions decisions, with a focus on large-
scale databases and meta-analytic evidence. They conclude
that (a) tests of developed abilities are generally valid for
their intended uses in predicting a wide variety of aspects
of short-term and long-term academic and job perfor-
mance, (b) validity is not an artifact of socioeconomic
status, (c) coaching is not a major determinant of test
performance, (d) tests do not generally exhibit bias by
underpredicting the performance of minority group mem-
bers, and (e) test-taking motivational mechanisms are not
major determinants of test performance in these high-
stakes settings.
Keywords: employment testing, admissions testing, selec-
tion, validity
A
s young adults complete high school in the United
States, they typically pursue one of three options:
continue their education, enter the civilian work
force, or join the military. In all three settings, there is a
long history of using standardized tests of developed cog-
nitive abilities for selection decisions. In these domains, the
tests themselves often are very similar. For example, Frey
and Detterman (2004) reported a correlation of .82 between
scores on the SAT, widely used for college admissions, and
a composite score on the Armed Services Vocational Ap-
titude Battery. Given the similarities in tests, test-taking
populations, and questions that commonly arise about test
use in these domains, in this article we examine both
educational admissions and personnel selection.
Testing is one aspect of the field of psychology with
which virtually the entire public comes into contact. As
members of the broader community, psychologists, regard-
less of their area of specialization, are likely to have contact
with family members, friends, and neighbors who are asked
to take tests as part of the educational admissions or occu-
pational entry processes. In their role as psychologists, they
are likely to be asked to comment about a range of testing
issues. We believe that there is much myth and hearsay
regarding standardized tests and that it is important for
psychologists to be aware of the central findings of the
testing literature. Thus, in this article, we summarize key
findings about a number of criticisms commonly made
about testing. In many cases, these claims are contrary to
findings considered well established within the testing re-
search community, and they are generally expressed in
contexts outside of the scientific literature, such as the
popular press. In other cases, they reflect issues still being
investigated and debated within the testing community. We
attempt to differentiate claims for which there is general
agreement within the testing community from claims that
are as yet unresolved. For these as yet unresolved claims,
we document specific instances of debate and summarize
relevant research in the area. We hope both types of infor-
mation will be helpful to psychologists in responding to
questions about test use. We focus on the following set of
assertions commonly made about testing:
Assertion 1: Tests predict badly, if at all. Correlations with
commonly used criteria (such as first-year college grades) are
small, typically in the .25–.35 range. The squared correlation
gives the percentage of criterion variance accounted for by the
test; thus a correlation of .30 indicates that the test accounts for
less than 10% of the variance in the criterion. Given the small
amount of variance accounted for by the tests, tests should not
play a significant role in high-stakes decisions.
Assertion 2: Tests do not measure all important determinants of
all important criteria. Even if tests do predict to the modest degree
outlined above, they predict performance in the short term only
(e.g., first year grades, performance in job training) and do not
predict criteria in the long term, such as earnings, or job/academic
success. There are additional determinants of these criteria other
than those measured by tests of developed abilities, such as
diligence, persistence, energy, and drive.
Assertion 3: Even if tests have some predictive value, they are
Paul R. Sackett, Matthew J. Borneman, and Brian S. Connelly, Depart-
ment of Psychology, University of Minnesota, Twin Cities Campus.
Full disclosure of interests: Paul R. Sackett has research grants from
and serves on the SAT Psychometric Panel of the College Board and
serves on the Research Advisory Committee of Psychological Services,
Inc. Matthew J. Borneman is an employee of PreVisor, an employment
testing firm. Brian S. Connelly has a student research grant from the
College Board.
Correspondence concerning this article should be addressed to Paul
R. Sackett, Department of Psychology, University of Minnesota, Twin
Cities Campus, Elliott Hall, 75 East River Road, Minneapolis, MN 55455.
E-mail: psackett@umn.edu
215May–June 2008

American Psychologist
Copyright 2008 by the American Psychological Association 0003-066X/08/$12.00
Vol. 63, No. 4, 215–227 DOI: 10.1037/0003-066X.63.4.215
Page 2
hidden
valuable only for screening out individuals with low scores.
Above a certain threshold, higher scores do not matter; thus, it is
inappropriate for colleges or employers to use test scores to
differentiate among those above this threshold.
Assertion 4: Tests serve merely as a proxy for wealth and priv-
ilege; they reflect socioeconomic status (SES) rather than devel-
oped abilities. Any predictive power that tests appear to have
disappears once one controls for SES.
Assertion 5: Tests are readily coached; those with knowledge of
this fact and the financial resources to pay for coaching programs
can substantially increase their scores.
Assertion 6: Tests are biased against members of racial and ethnic
minority groups, and sometimes against women as well, as evi-
denced by the common finding of substantially lower mean scores
for minority groups.
Assertion 7: While minority group members obtain lower mean
scores on tests, they perform just as well as majority group
members once admitted or hired.
Assertion 8: Motivational mechanisms, such as stereotype threat,
explain majority–minority group mean differences.
We respond to each of these assertions, presenting
what we view as the most compelling data on each. We
focus on meta-analytic syntheses and/or large, nationally
representative samples whenever these are available. Indi-
vidual small-sample studies are prone to features that can
distort findings, such as random sampling idiosyncrasies
(cf. Hunter & Schmidt, 2004). Consequently, there is no
doubt that readers can locate individual studies with find-
ings contrary to large-sample findings. However, we be-
lieve that an overview of the findings of large-scale studies
and meta-analytic syntheses will give the clearest picture of
the cumulative body of knowledge on the set of issues
outlined above.
Assertion 1: Tests Predict Badly
Prototypically, admissions tests correlate about .35 with
first-year grade point average (GPA), and employment tests
correlate about .35 with job training performance and about
.25 with performance on the job. One reaction to these
findings is to square these correlations to obtain the vari-
ance accounted for by the test (.25 accounts for 6.25%; .35
accounts for 12.25%) and to question the appropriateness
of giving tests substantial weight in selection or admissions
decisions given these small values (e.g., Sternberg, Wag-
ner, Williams, & Horvath, 1995; Vasquez & Jones, 2006).
One response to this reaction is to note that even if the
values above were accurate (and we make the case below
that they are, in fact, substantial underestimates), correla-
tions of such magnitude are of more value than critics
recognize. As long ago as 1928, Hull criticized the small
percentage of variance accounted for by commonly used
tests. In response, a number of scholars developed alternate
metrics designed to be more readily interpretable than
“percentage of variance accounted for” (Lawshe, Bolda, &
Auclair, 1958; Taylor & Russell, 1939). Lawshe et al.
(1958) tabled the percentage of test takers in each test score
quintile (e.g., top 20%, next 20%, etc.) who met a set
standard of success (e.g., being an above-average per-
former on the job or in school). A test correlating .30 with
performance can be expected to result in 67% of those in
the top test quintile being above-average performers (i.e., 2
to 1 odds of success) and 33% of those in the bottom
quintile being above-average performers (i.e., 1 to 2 odds
of success). Converting correlations to differences in odds
of success results both in a readily interpretable metric and
in a positive picture of the value of a test that “only”
accounts for 9% of the variance in performance. Subse-
quent researchers have developed more elaborate models of
test utility (e.g., Boudreau & Rynes, 1985; Brogden, 1946,
1949; Cronbach & Gleser, 1965; Murphy, 1986) that make
similar points about the substantial value of tests with
validities of the magnitude commonly observed. In short,
there is a long history of expressing the value of a test in a
metric more readily interpretable than percentage of vari-
ance accounted for.
Another response to the practice of squaring and der-
ogating observed validity coefficients is to note that ob-
served validity coefficients are typically substantial under-
estimates of the operational validity of a test. There are two
key truths, well-known to test researchers but often either
not recognized or rejected by test critics: first, that studies
of highly select samples underestimate validity by restrict-
ing the range of observed scores and, second, that unreli-
able criterion measures result in an underestimation of the
true operational validity. We discuss each of these in turn.
Range Restriction Leads to Underestimates of
Validity
Consider an employer hiring only individuals with test
scores above the 50th percentile. The sample of selected
individuals will have a smaller standard deviation of test
scores than the standard deviation in the full applicant pool,
Paul R.
Sackett
216 May–June 2008

American Psychologist

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

24 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Ph.D. Student
 
13% Student (Master)
 
8% Post Doc
by Country
 
71% United States
 
8% Germany
 
4% United Kingdom