Measures of reliability in sports medicine and science.
- PubMed: 10907753
Abstract
Reliability refers to the reproducibility of values of a test, assay or other measurement in repeated trials on the same individuals. Better reliability implies better precision of single measurements and better tracking of changes in measurements in research or practical settings. The main measures of reliability are within-subject random variation, systematic change in the mean, and retest correlation. A simple, adaptable form of within-subject variation is the typical (standard) error of measurement: the standard deviation of an individual's repeated measurements. For many measurements in sports medicine and science, the typical error is best expressed as a coefficient of variation (percentage of the mean). A biased, more limited form of within-subject variation is the limits of agreement: the 95% likely range of change of an individual's measurements between 2 trials. Systematic changes in the mean of a measure between consecutive trials represent such effects as learning, motivation or fatigue; these changes need to be eliminated from estimates of within-subject variation. Retest correlation is difficult to interpret, mainly because its value is sensitive to the heterogeneity of the sample of participants. Uses of reliability include decision-making when monitoring individuals, comparison of tests or equipment, estimation of sample size in experiments and estimation of the magnitude of individual differences in the response to a treatment. Reasonable precision for estimates of reliability requires approximately 50 study participants and at least 3 trials. Studies aimed at assessing variation in reliability between tests or equipment require complex designs and analyses that researchers seldom perform correctly. A wider understanding of reliability and adoption of the typical error as the standard measure of reliability would improve the assessment of tests and equipment in our disciplines.
Author-supplied keywords
Measures of reliability in sports medicine and science.
Medicine and Science
Will G. Hopkins
Department of Physiology, School of Medical Sciences and School of Physical Education,
University of Otago, Dunedin, New Zealand
Abstract Reliability refers to the reproducibility of values of a test, assay or other meas-
urement in repeated trials on the same individuals. Better reliability implies better
precision of single measurements and better tracking of changes in measurements
in research or practical settings. The main measures of reliability are within-subject
random variation, systematic change in the mean, and retest correlation. Asimple,
adaptable form of within-subject variation is the typical (standard) error of meas-
urement: the standard deviation of an individual s repeated measurements. For
many measurements in sports medicine and science, the typical error is best
expressed as a coefficient of variation (percentage of the mean). A biased, more
limited form of within-subject variation is the limits of agreement: the 95% likely
range of change of an individual s measurements between 2 trials. Systematic
changes in the mean of a measure between consecutive trials represent such
effects as learning, motivation or fatigue; these changes need to be eliminated
from estimates of within-subject variation. Retest correlation is difficult to inter-
pret, mainly because its value is sensitive to the heterogeneity of the sample of
participants. Uses of reliability include decision-making when monitoring indi-
viduals, comparison of tests or equipment, estimation of sample size in experi-
ments and estimation of the magnitude of individual differences in the response
to a treatment. Reasonable precision for estimates of reliability requires approx-
imately 50 study participants and at least 3 trials. Studies aimed at assessing
variation in reliability between tests or equipment require complex designs and
analyses that researchers seldom perform correctly. A wider understanding of
reliability and adoption of the typical error as the standard measure of reliability
would improve the assessment of tests and equipment in our disciplines.
CURRENT OPINION
Sports Med 2000 Jul; 30 (1): 1-15
0112-1642/00/0007-0001/$20.00/0
Adis International Limited. All rights reserved.
Measurement error makes the observed value of
a measure differ from the true value. Anyone who
takes or uses measurements should therefore have
some understanding of measurement error. In my
experience, the 2 most important aspects of meas-
urement error are concurrent validity and retest re-
liability. Concurrent validity concerns the agree-
ment between the observed value and the true or
criterion value of a measure. Retest reliability con-
cerns the reproducibility of the observed value when
the measurement is repeated. Analysis of validity
is complex, owing to the inevitable presence of er-
ror in the criterion value. I have therefore limited
this article to the measurement errors that are acces-
sible in reliability studies. These errors have a ma-
jor impact on our attempts to measure changes be-
cern for anyone interested in a single measurement.
Studying the reliability of a measure is a straight-
forward matter of repeating the measurement a rea-
sonable number of times on a reasonable number
of individuals. The most important measurement
error to come out of such a study is the random
error or noise in the measure: the smaller the er-
ror, the better the measure. How best to represent
this error and several other measures of reliability
is a matter of debate. Atkinson and Nevill
[1]
con-
tributed a useful point of view in their review of
reliability in this journal recently, but I have a dif-
ferent perspective on the relative merits of the var-
ious measures of reliability. In the present article I
justify my choice of the most appropriate meas-
ures. I also explore the uses of reliability and deal
with the design and analysis of reliability studies.
My approach to reliability is appropriate for most
variables that have numbers as values (e.g. 71.3kg
for body mass). Reliability of measures that have
labels as values (e.g. female for sex) is beyond the
scope of the present article.
1. Measures of Reliability
When we speak of reliability, we refer to the
repeatability or reproducibility of a measure or vari-
able. I will sometimes follow the popular but in-
accurate convention of referring not to the reliabil-
ity of a measure but to the reliability of the test,
assay or instrument that provided the measure. I
will also use the word trials to mean repeated ad-
ministrations of a test or assay.
Researchers quantify reliability in a variety of
ways. I deal here with what I believe are the only
3 important types of measure: within-subject vari-
ation, change in the mean, and retest correlation.
[2]
1.1 Within-Subject Variation
Within-subject variation is the most important
type of reliability measure for researchers, because
it affects the precision of estimates of change in the
variable of an experimental study. It is also the most
important type of reliability measure for coaches,
physicians, scientists and other professionals using
tests to monitor the performance or health of their
clients. In these situations, the smaller the within-
subject variation, the easier it will be to notice or
measure a change in performance or health.
An easy way to understand the meaning of within-
subject variation is to regard it as the random vari-
ation in a measure when one individual is tested
many times. For example, if the values for many
trials of one individual are 71, 76, 74, 79, 79 and
76, there is a random variation of a few units be-
tween trials. A statistic that captures this notion of
random variability of a single individual s values
on repeated testing is the standard deviation of the
individual s values. This within-subject standard
deviation is also known as the standard error of
measurement. In plain language, it represents the
typical error in a measurement, and that is how I
will refer to it hereafter.
The variation represented by typical error comes
from several sources. The main source is usually
biological. For example, an individual s maximum
power output changes between trials because of
changes in mental or physical state. Equipment may
also contribute noise to the measurements, although
in simple reliability studies this technological source
of error is often unavoidably lumped in with the
biological error. When the same individual is re-
tested on different equipment or by different oper-
ators, additional error due to differences in the cali-
bration or functioning of the equipment or in the
ability of the operators can surface. An analogous
situation occurs when different judges rate the same
athlete in different locations. I will deal with these
and other complex examples of reliability in section
3.3.
In most situations where reliability is an issue,
we are interested in the simple question of repro-
ducibility of an individual s values obtained on the
same piece of equipment by the same operator. To
estimate typical error in these situations, we usu-
ally use many participants and a few trials rather
than 1 participant and many trials. For example, for
5 participants in 2 trials, with the values shown in
table I, the typical error is 2.9. We can still interpret
the typical error of 2.9 as the variation we would
2 Hopkins
Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


