Hacker News new | ask | show | jobs
by judofyr 761 days ago
Very interesting and I'll have to read more to understand how it fully works, but _initially_ the space requirements doesn't seem too impressive? Am I missing something here? Is my calculation/assumption completely off? Maybe the solution here is more flexible?

One alternative approach for many of these problems is to start with a perfect minimal hash function which hashes your key into a unique number [0, N) and then have a packed array of size N where each element is of a fixed byte size. To look up the value you first use the hash function to get an index, and then you look up in the packed array. There's also no error rate here: This is exact.

PTHash (https://arxiv.org/abs/2104.10402) needs roughly ~4 bits per element to create a perfect minimal hash function.

> Store 1 billion web URLs and assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, =0.1%; 19 bits per element)

Assuming that "n=8" here means "8 bits" we need 1GB (8 bits * billion) to represent all of the values, and then 500 MB for the hash function (4 bits * billion).

I also don't quite understand what "2.22Gb" here refers to. 19 bits * billion = 2.357 SI-giga bytes = 19 SI-giga bits = 2.212 gibi bytes.

> Store 1 billion DNA or RNA k-mers ("ACGTA...") and associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, =0.1%; 59 bits per element)

"~500k bacterial ID" can be represented with 19 bits. 1 billion of these take ~2.3GB, and then we have the additional 500MB for the perfect hash function.

Another data structure which is even more fine-tuned for this problem space is Bumped Ribbon Retrieval (https://arxiv.org/abs/2109.01892) where they have <1% overhead over just storing the plain bit values.

EDIT: Aha! One thing I forgot about: The alternatives I mentioned above all have a construction cost. I've been playing with them in the 100k-1M range and they've all been pretty instant (<1s), but I don't have any experience in the billion range. Maybe it's too slow there?

2 comments

PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).

What I'm wondering is why the Kraken2 probabilistic hash table doesn't work. It uses 32 bits per element in an open addressing hash table. For 1 billion k-mers and 19 bits for the value, 32 - 19 = 13 bits of the key hash can be stored alongside the value, helping disambiguate hash collisions. If the load factor is 1.25x, then that's 4 * 10^9 * 1.25 = 5GB total, better than ~7GB. Also, this is only one cache miss (+ linear probing that can be SIMD accelerated) per lookup.

> PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).

Yes, exactly.

> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.

I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:

* Like the B-field, there are also false positives.

* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.

* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.

So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.

For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.

One huge downside of your suggested approach is that adding a single entry requires rebuilding the entire hash.