Sparx: Distributed Outlier Detection at Scale

6Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.

Cite

CITATION STYLE

APA

Zhang, S., Ursekar, V., & Akoglu, L. (2022). Sparx: Distributed Outlier Detection at Scale. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 4530–4540). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539076

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free