Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search

Yiqiu Wang; Anshumali Shrivastava; Jonathan Wang; Junghee Ryu

Conference ProceedingsOPEN ACCESS

Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search

Proceedings of the ACM SIGMOD International Conference on Management of Data (2018) 889-903

DOI: 10.1145/3183713.3196925

16Citations

56Readers

Abstract

We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2D), will require at least 20 terafiops. We provide CPU and GPU implementations of FLASH for replicability of our results1.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, Y., Shrivastava, A., Wang, J., & Ryu, J. (2018). Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 889–903). Association for Computing Machinery. https://doi.org/10.1145/3183713.3196925

Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search

Abstract

Author supplied keywords

Cite

Register to see more suggestions