Sparx: Distributed Outlier Detection at Scale

Sean Zhang; Varun Ursekar; Leman Akoglu

Conference ProceedingsOPEN ACCESS

Sparx: Distributed Outlier Detection at Scale

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) 4530-4540

DOI: 10.1145/3534678.3539076

6Citations

11Readers

Abstract

There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, S., Ursekar, V., & Akoglu, L. (2022). Sparx: Distributed Outlier Detection at Scale. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4530–4540). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539076

Sparx: Distributed Outlier Detection at Scale

Abstract

Author supplied keywords

Cite

Register to see more suggestions