Hacker News new | ask | show | jobs
by haberman 2306 days ago
There are 555,278,657 passwords in the database. With a Bloom Filter, you could quickly rule out potential inputs. Even better, there is no need to hash the input, because... it's already a cryptographic hash.

The input SHA-1 provides 160 bits of hash. If we divide that up into 5 hash values of 32 bits, we can get a 3% false positive rate with a 483 MiB Bloom Filter (which will easily fit in memory).

https://hur.st/bloomfilter/?n=555M&p=0.03&m=&k=

This will be blindingly fast. We're talking 5 random reads from memory. Even in the worst case of 5 cache misses, we're still well under 1us. This will let us return "not found" for 97% of inputs that aren't in the database.

If we get a hit there, then we could turn to a larger bloom filter for greater accuracy, but we'd have to actually hash the key to get more hash bits.

Of course if you get hits for all your bloom filters, you still have to do a real lookup to positively confirm that the key is in the database.

1 comments

> Of course if you get hits for all your bloom filters, you still have to do a real lookup to positively confirm that the key is in the database.

As I replied elsewhere, please do not ignore this part for good user experience. I've seen it ignored in open source projects (Keycloak). You don't wanna block perfectly fine passwords for reasons unknown to the user because of false positives. That might cause unwanted reactions at your user's side ("was my password leaked!?!?").

The false positive rate can be made arbitrarily small. I have an implementation with a fp rate of 1:1.000.000 with a filter size of just 1.8 GB (https://github.com/adewes/have-i-been-bloomed). The cost of lowering the rate is logarithmic so you could go to much lower values for little cost (1:1.000.000.000 would be 2.8 GB) so false positives are not really a problem in practice. If you really want you could still perform an exact check against a DB if the filter returns true to rule out false positives with certainty, though at one false positive for one billion requests this might be exaggerated.
That makes sense. I was thinking of this more as a fun algorithms optimization challenge, not something to actually put into a production setting with real users. I agree that for real users you would especially not want to skip the last step.