|
|
|
|
|
by joppy
2307 days ago
|
|
A bloom filter is different here. Imagine we're working with a fixed 512MB of RAM, which we're going to use as a bitset addressing a range of [0 ... 2^32). An over-estimate of the 22.8GB hash file at 40 characters per hash gives that it it contains about 2^29 hashes, so if we store the 32-bit prefix of each hash in this table, we expect it to be about (2^29 / 2^32) = 1/8th full. We can use this set of prefixes to filter the data. Given an input hash, check whether its prefix is in the table. If it's not in the table, the prefix is certainly not in the file. If the prefix is in the table, there is a 1/8 ~ 13% chance of a "false positive" where the query is not actually in the file, but the set thinks it is. What a Bloom filter does is start using the rest of the hash, by storing the presence of the next 32-bits of the hash in the same set, and so on. By storing the next 32 bits as well, we can drop the false positive rate to 6%. By using the next chunk of 32 bits, it drops again to 4%, and finally when storing 5 32-bit chunks (very conveniently, we have at this point "used up" all 32-bit chunks of our original 160 bits) to around 2.5%. All the while, the bloom filter is fitting into the same 512MB we originally gave it. |
|