Data clustering to optimise the representativity of observational data in air quality data assimilation: A case study with EURAD-IM (version 5.9.1 DA)

Alexander Hermanns; Anne Caroline Lange; Julia Kowalski; Hendrik Fuchs; Philipp Franke

Journal ArticleOPEN ACCESS

Data clustering to optimise the representativity of observational data in air quality data assimilation: A case study with EURAD-IM (version 5.9.1 DA)

Geoscientific Model Development (2025) 18(23) 9417-9432

DOI: 10.5194/gmd-18-9417-2025

0Citations

3Readers

Abstract

In the field of air quality analysis, data assimilation is commonly used to integrate information on the atmospheric state provided by observations into the model. However, the analysis is largely dependent on the data available to the assimilation system. In order to obtain an accurate analysis of the true state of the atmosphere, the representativity of the utilized data becomes a fundamental requirement. Here, a method is presented that derives a representative split of the ground-based monitoring network data that depends only on the characteristics of the observation data. The core of the methods is a clustering algorithm to subdivide the data into subsets. Two clustering algorithms, k-means, and k-mean soft constraint, are tested and applied to air pollutant observations in Europe. The clusters are solely derived from observation intrinsic properties (such as geographic location and averaged concentrations). The resulting clusters reliably distinguish common features of the observational data, e.g. mean and variance of averaged air pollutant concentrations. Representativity of the observational data in the assimilation and validation subset is ensured by sampling each cluster individually. The method is tested using the assimilation system of the chemistry transport model EURAD-IM (EURopean Air pollution Dispersion - Inverse Model) and evaluated for data from four months in 2016. A significant improvement of the analysis' representativity, quantified by the difference between the analysis' root mean square error with respect to the assimilation and validation dataset, is found in the results. Compared to an operational configuration, the largest improvement in the relative representativity measure is evaluated for CO with 16 %, for NO2 with 4 %, and for O3 with 1 %. A reduction in the relative representativity measure is observed for SO2 with -5 %, for PM10 with -2 % and for PM2.5 with -5 %, although these differences do not lead to significant deviations in absolute values given the overall error and the improvement for CO outweighing the changes in the other species.

Cite

CITATION STYLE

APA

Hermanns, A., Lange, A. C., Kowalski, J., Fuchs, H., & Franke, P. (2025). Data clustering to optimise the representativity of observational data in air quality data assimilation: A case study with EURAD-IM (version 5.9.1 DA). Geoscientific Model Development, 18(23), 9417–9432. https://doi.org/10.5194/gmd-18-9417-2025

Data clustering to optimise the representativity of observational data in air quality data assimilation: A case study with EURAD-IM (version 5.9.1 DA)

Abstract

Cite

Register to see more suggestions