In the quality control of an assessment with multiple forms, one goal is to attain a stable scale across time. Variability and seasonality across examinee samples and test conditions could cause variation in IRT linking and equating procedures and twist the “sampling exchangeability” in the Draper–Lindley–de Finetti (DLD) measurement validity framework. As an initial exploration of optimal design in linking, we intended to obtain an improved sampling design for invariant Stocking–Lord test characteristic curve (TCC) linking across testing seasons. We applied statistical weighting techniques, such as raking and poststratification, to yield a weighted sample distribution that is consistent with the target population distribution. To assess the weighting effects on linking, we first selected multiple subsamples from an original sample; then, we compared the linking parameters from subsamples with those from the original sample. The results showed that the linking parameters from the weighted sample yielded smaller mean square errors (MSE) than those from the unweighted subsample. The developed techniques can be applied to (1) assessments such as GRE® and TOEFL® with variability and seasonality among multiple forms and (2) assessments such as state assessments with linking decisions based on small initial data.
CITATION STYLE
Qian, J., von Davier, A. A., & Jiang, Y. (2013). Achieving a stable scale for an assessment with multiple forms: Weighting test samples in IRT linking. In Springer Proceedings in Mathematics and Statistics (Vol. 66, pp. 171–185). Springer New York LLC. https://doi.org/10.1007/978-1-4614-9348-8_11
Mendeley helps you to discover research relevant for your work.