HyperMinHash: MinHash in LogLog Space

Yun William Yu; Griffin M. Weber

Journal ArticleOPEN ACCESS

HyperMinHash: MinHash in LogLog Space

IEEE Transactions on Knowledge and Data Engineering (2022) 34(1) 328-339

DOI: 10.1109/TKDE.2020.2981311

18Citations

25Readers

Abstract

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(log) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For a additive approximation error on a Jaccard index t, given a random oracle, HyperMinHash needs O-2(\log n + 1)Oϵ-2loglogn+log1ϵ space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10 percent using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

Author supplied keywords

Cite

CITATION STYLE

APA

Yu, Y. W., & Weber, G. M. (2022). HyperMinHash: MinHash in LogLog Space. IEEE Transactions on Knowledge and Data Engineering, 34(1), 328–339. https://doi.org/10.1109/TKDE.2020.2981311

HyperMinHash: MinHash in LogLog Space

Abstract

Author supplied keywords

Cite

Register to see more suggestions