Quality of data is often measured by counting artifacts. While this procedure is very simple and applicable to many different types of artifacts like errors, inconsistencies and missing values, counts do not differentiate between different distributions of data artifacts. A possible solution is to add a randomness measure to indicate how randomly data artifacts are distributed. It has been proposed to calculate randomness by means of the Lempel-Ziv complexity algorithm, this approach comes with some demerits. Most importantly, the Lempel-Ziv approach assumes that there is some implicit order among data objects and the measured randomness depends on this order. To overcome this problem, a new method is proposed which measures randomness proportionate to the average amount of bits needed to compress the bit matrix matching the artifacts in a database relation by using unary coding. It is shown that this method has several interesting properties that align the proposed measurement procedure with the intuitive perception of randomness.
CITATION STYLE
Boeckling, T., Bronselaer, A., & De Tré, G. (2018). Randomness of data quality artifacts. In Communications in Computer and Information Science (Vol. 855, pp. 529–540). Springer Verlag. https://doi.org/10.1007/978-3-319-91479-4_44
Mendeley helps you to discover research relevant for your work.