| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thedance 2307 days ago
	You're right. A bloom filter is a random projection of the input domain. A random projection of a uniformly-populated random input is not useful. A large set of password hashes should be already maximally entropic.

3 comments

joppy 2306 days ago

A bloom filter is different here. Imagine we're working with a fixed 512MB of RAM, which we're going to use as a bitset addressing a range of [0 ... 2^32). An over-estimate of the 22.8GB hash file at 40 characters per hash gives that it it contains about 2^29 hashes, so if we store the 32-bit prefix of each hash in this table, we expect it to be about (2^29 / 2^32) = 1/8th full.

We can use this set of prefixes to filter the data. Given an input hash, check whether its prefix is in the table. If it's not in the table, the prefix is certainly not in the file. If the prefix is in the table, there is a 1/8 ~ 13% chance of a "false positive" where the query is not actually in the file, but the set thinks it is.

What a Bloom filter does is start using the rest of the hash, by storing the presence of the next 32-bits of the hash in the same set, and so on. By storing the next 32 bits as well, we can drop the false positive rate to 6%. By using the next chunk of 32 bits, it drops again to 4%, and finally when storing 5 32-bit chunks (very conveniently, we have at this point "used up" all 32-bit chunks of our original 160 bits) to around 2.5%. All the while, the bloom filter is fitting into the same 512MB we originally gave it.

link

thedance 2306 days ago

Thanks. I assumed it was more densely populated than that, but now that I think about it it's obviously a human-scale set of passwords.

link

marcan_42 2306 days ago

No, a bloom filter requires k random projections of the input domain over a domain of m elements. In my smaller sample configuration, k=11 and m~=5000000000. log2(5000000000^11) ~= 354 bits (and you need a few more for padding to get better uniformity). SHA-1 hashes are 160 bits, so you can't use the original hash directly to index into the bloom bitmap. You need to hash again at least twice (or once with something like SHA-512).

But really, hashing is so cheap there's no point in trying to be clever like this for an on-disk implementation. My code is designed to take any strings as input, whether they are HIBP hex hashes already or something else. If this were intended to be a high performance in-memory implementation used for a database or something, the constraints would be different.

link

bscphil 2307 days ago

Thanks! That's what I was thinking.

link