|
|
|
|
|
by Thorrez
2078 days ago
|
|
Since the hash is 60 bits, a collision is expected after 2^30 tries due to the birthday paradox. So by your math this would be 12.3GB. Although this is ignoring hash table overhead, which I expect to be at least 2x. Maybe they just got lucky. |
|
But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.
The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).