Abstract
There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.
Author supplied keywords
Cite
CITATION STYLE
Zhang, S., Ursekar, V., & Akoglu, L. (2022). Sparx: Distributed Outlier Detection at Scale. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4530–4540). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539076
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.