HyperMinHash: MinHash in LogLog Space

18Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(log) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For a additive approximation error on a Jaccard index t, given a random oracle, HyperMinHash needs O-2(\log n + 1)Oϵ-2loglogn+log1ϵ space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10 percent using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

Cite

CITATION STYLE

APA

Yu, Y. W., & Weber, G. M. (2022). HyperMinHash: MinHash in LogLog Space. IEEE Transactions on Knowledge and Data Engineering, 34(1), 328–339. https://doi.org/10.1109/TKDE.2020.2981311

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free