Clustering is arguably the most important primitive for data mining, finding use as a subroutine in many higher-order algorithms. In recent years, the community has redirected its attention from the batch case to the online case. This need to support online clustering is engendered by the proliferation of cheap ubiquitous sensors that continuously monitor various aspects of our world, from heartbeats as we exercise to the number of mosquitoes visiting a well in a village in Ethiopia. In this work, we argue that current online clustering solutions offer a room for improvement. To some degree they all have at least one of the following shortcomings: they are parameter-laden, only defined for certain distance functions, sensitive to outliers, and/or they are approximate. This last point requires clarification; in some sense almost all clustering algorithms are approximate. For example, in general, k-means only approximately optimizes its objective function. However, streaming versions of the k-means algorithm are further approximating this approximation, potentially leading to very poor solutions. In this work, we introduce an algorithm that mitigates these flaws. It is parameter-lite, defined for any distance function, insensitive to outliers and produces the same output as the batch version of the algorithm. We demonstrate the utility and effectiveness of our ideas with case studies in entomology, cardiology and biological audio processing.
CITATION STYLE
Ulanova, L., Begum, N., Shokoohi-Yekta, M., & Keogh, E. (2016). Clustering in the face of fast changing streams. In 16th SIAM International Conference on Data Mining 2016, SDM 2016 (pp. 1–9). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611974348.1
Mendeley helps you to discover research relevant for your work.