Randomness of data quality artifacts

Toon Boeckling; Antoon Bronselaer; Guy De Tré

Conference Proceedings

Randomness of data quality artifacts

Communications in Computer and Information Science (2018) 855 529-540

DOI: 10.1007/978-3-319-91479-4_44

0Citations

4Readers

Get full text

Abstract

Quality of data is often measured by counting artifacts. While this procedure is very simple and applicable to many different types of artifacts like errors, inconsistencies and missing values, counts do not differentiate between different distributions of data artifacts. A possible solution is to add a randomness measure to indicate how randomly data artifacts are distributed. It has been proposed to calculate randomness by means of the Lempel-Ziv complexity algorithm, this approach comes with some demerits. Most importantly, the Lempel-Ziv approach assumes that there is some implicit order among data objects and the measured randomness depends on this order. To overcome this problem, a new method is proposed which measures randomness proportionate to the average amount of bits needed to compress the bit matrix matching the artifacts in a database relation by using unary coding. It is shown that this method has several interesting properties that align the proposed measurement procedure with the intuitive perception of randomness.

Author supplied keywords

Cite

CITATION STYLE

APA

Boeckling, T., Bronselaer, A., & De Tré, G. (2018). Randomness of data quality artifacts. In Communications in Computer and Information Science (Vol. 855, pp. 529–540). Springer Verlag. https://doi.org/10.1007/978-3-319-91479-4_44

Randomness of data quality artifacts

Abstract

Author supplied keywords

Cite

Register to see more suggestions