Correlation by compression

Kailash Budhathoki; Jilles Vreeken

Conference ProceedingsOPEN ACCESS

Correlation by compression

Proceedings of the 17th SIAM International Conference on Data Mining, SDM 2017 (2017) 525-533

DOI: 10.1137/1.9781611974973.59

2Citations

9Readers

Abstract

Discovering correlated variables is one of the core problems in data analysis. Many measures for correlation have been proposed, yet it is surprisingly ill-defined in general. That is, most, if not all, measures make very strong assumptions on the data distribution or type of dependency they can detect. In this work, we provide a general theory on correlation, without making any such assumptions. Simply put, we propose correlation by compression. To this end, we propose two correlation measures based on solid information theoretic foundations, i.e. Kolmogorov complexity. The proposed correlation measures possess interesting properties desirable for any sensible correlation measure. However, Kolmogorov complexity is not computable, and hence we propose practical and computable instantiations based on the Minimum Description Length (MDL) principle. In practice, we can apply the proposed measures on any type of data by instantiating them with any lossless real-world compressors that reward pairwise dependencies. Extensive experiments show that the correlation measures works well in practice, have high statistical power, and find meaningful correlations on binary data, while they are easily extendible to other data types.

Cite

CITATION STYLE

APA

Budhathoki, K., & Vreeken, J. (2017). Correlation by compression. In Proceedings of the 17th SIAM International Conference on Data Mining, SDM 2017 (pp. 525–533). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611974973.59

Correlation by compression

Abstract

Cite

Register to see more suggestions