Hacker News new | ask | show | jobs
by bscphil 2307 days ago
> Personally, when I implemented this in a web service, I used a bloom filter.

I'm not a professional programmer, but I would have supposed a bloom filter is in this case equivalent to just saving the first n bits of every sha1 hash. Is that wrong?

2 comments

You're right. A bloom filter is a random projection of the input domain. A random projection of a uniformly-populated random input is not useful. A large set of password hashes should be already maximally entropic.
A bloom filter is different here. Imagine we're working with a fixed 512MB of RAM, which we're going to use as a bitset addressing a range of [0 ... 2^32). An over-estimate of the 22.8GB hash file at 40 characters per hash gives that it it contains about 2^29 hashes, so if we store the 32-bit prefix of each hash in this table, we expect it to be about (2^29 / 2^32) = 1/8th full.

We can use this set of prefixes to filter the data. Given an input hash, check whether its prefix is in the table. If it's not in the table, the prefix is certainly not in the file. If the prefix is in the table, there is a 1/8 ~ 13% chance of a "false positive" where the query is not actually in the file, but the set thinks it is.

What a Bloom filter does is start using the rest of the hash, by storing the presence of the next 32-bits of the hash in the same set, and so on. By storing the next 32 bits as well, we can drop the false positive rate to 6%. By using the next chunk of 32 bits, it drops again to 4%, and finally when storing 5 32-bit chunks (very conveniently, we have at this point "used up" all 32-bit chunks of our original 160 bits) to around 2.5%. All the while, the bloom filter is fitting into the same 512MB we originally gave it.

Thanks. I assumed it was more densely populated than that, but now that I think about it it's obviously a human-scale set of passwords.
No, a bloom filter requires k random projections of the input domain over a domain of m elements. In my smaller sample configuration, k=11 and m~=5000000000. log2(5000000000^11) ~= 354 bits (and you need a few more for padding to get better uniformity). SHA-1 hashes are 160 bits, so you can't use the original hash directly to index into the bloom bitmap. You need to hash again at least twice (or once with something like SHA-512).

But really, hashing is so cheap there's no point in trying to be clever like this for an on-disk implementation. My code is designed to take any strings as input, whether they are HIBP hex hashes already or something else. If this were intended to be a high performance in-memory implementation used for a database or something, the constraints would be different.

Thanks! That's what I was thinking.
It's not just saving a few bits. A bloom filter requires a lot less storage than just chopping the hashes for the same false positive performance, because it uses multiple hashing steps (in my default config, 11, but this is tunable depending on false positive rate and input size), and all of them have to hit bits that are set in the filter for the match to be positive.

Pwned passwords 2.0 was half a billion passwords. That's 29 bits. To get a 0.05% false positive rate you need an additional 11 bits or so. That's 40 bits per hash, or 20 billion bits total, and then you need to binary search it (29 lookups) or make it into a tree and add bookkeeping structures.

The equivalent performance bloom filter only takes 5 billion bits of storage and 11 single bit checks in a bitmap.