Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms

1Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key issue is concept drift, that is, when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. An alternative is to reduce the bias by means of bias correction methods. In the article, we focus on estimating the proportion (base rate) of a category of interest and we compare two popular bias correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two methods analytically as well as numerically. Our analytical results are expressions for the bias and variance of both methods. As numerical result, we present a decision boundary for the relative performance of the two methods. Our results provide a better understanding of the effect of prior probability shift on output quality. Consequently, we may recommend a novel approach on how to use machine learning algorithms in the context of official statistics.

Cite

CITATION STYLE

APA

Meertens, Q. A., Diks, C. G. H., Van Den Herik, H. J., & Takes, F. W. (2022). Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms. Journal of Official Statistics, 38(2), 485–508. https://doi.org/10.2478/jos-2022-0023

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free