Abstract
Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated. Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.
Author supplied keywords
Cite
CITATION STYLE
Montiel, J., Ngo, H. A., Le-Nguyen, M. H., & Bifet, A. (2022). Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4808–4809). Association for Computing Machinery. https://doi.org/10.1145/3534678.3542600
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.