Abstract
dataquieR is an R package to conduct data quality assessments in data collections designed for research. It makes strong use of metadata that specify the requirements of the study data. Spreadsheet tables can be used to collect this information in a standardized manner. dataquieR starts with checking the formal compliance of study data with expectations defined in the metadata, such as the data type, during integrity analyses. Depending on available metadata, further data quality assessments cover the dimensions completeness, consistency, and accuracy as proposed by the framework of Schmidt et al. (2020). Three dataquieR functions investigate the completeness of data within and across observational units. Consistency-related analysis comprises two aspects. First, depending on the data type, the compliance of data elements with either user-defined limits or the adherence to expected value lists is investigated. Second, contradictions between data values of two data elements can be identified by using one of eleven logical comparisons, e.g., if systolic blood pressure is lower than diastolic blood pressure whereas the opposite is expected. Eight dataquieR functions support accuracy-related analyses by aiming at unexpected distributions of single or multiple data elements. Particular focus is placed on the influence of observers, examiners, and devices on the measurement process. Statement of Need Various data quality concepts have been proposed to evaluate data's "fitness for use" including different definitions of terms and focus areas (Cai & Zhu, 2015). To comprehend differences underlying these approaches, Keller et al. (2017) stressed the importance to differentiate between (a) designed data collections, (b) administrative data, and (c) opportunity data. Kahn et al. (2016) had already proposed a concept of data quality tailored for electronic health records (EHR) data. Schmidt et al. (2020) have recently introduced a framework addressing specifically the requirements of designed research data collections. Data collected for research purposes differs substantially from EHR data as the researchers are involved in the design, the conduct and the control of the measurement process. Further, enriched metadata, describing the collected data elements beyond datatypes and labels, is commonly available, as well as process information, i.e. the circumstances under which data have been generated (Richter et al., 2019). dataquieR was developed to make specific use of metadata and process information for data quality assessments in designed data collections, and to complement a data quality framework for research data collections. Richter et al., (2021). dataquieR: assessment of data quality in epidemiological research. Journal of Open Source Software, 6(61), 3093.
Cite
CITATION STYLE
Richter, A., Schmidt, C., Krüger, M., & Struckmann, S. (2021). dataquieR: assessment of data quality in epidemiological research. Journal of Open Source Software, 6(61), 3093. https://doi.org/10.21105/joss.03093
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.