Discovering correlated variables is one of the core problems in data analysis. Many measures for correlation have been proposed, yet it is surprisingly ill-defined in general. That is, most, if not all, measures make very strong assumptions on the data distribution or type of dependency they can detect. In this work, we provide a general theory on correlation, without making any such assumptions. Simply put, we propose correlation by compression. To this end, we propose two correlation measures based on solid information theoretic foundations, i.e. Kolmogorov complexity. The proposed correlation measures possess interesting properties desirable for any sensible correlation measure. However, Kolmogorov complexity is not computable, and hence we propose practical and computable instantiations based on the Minimum Description Length (MDL) principle. In practice, we can apply the proposed measures on any type of data by instantiating them with any lossless real-world compressors that reward pairwise dependencies. Extensive experiments show that the correlation measures works well in practice, have high statistical power, and find meaningful correlations on binary data, while they are easily extendible to other data types.
CITATION STYLE
Budhathoki, K., & Vreeken, J. (2017). Correlation by compression. In Proceedings of the 17th SIAM International Conference on Data Mining, SDM 2017 (pp. 525–533). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611974973.59
Mendeley helps you to discover research relevant for your work.