Hacker News new | ask | show | jobs
by nulptr 2077 days ago
I have a question about how you generated collisions.

If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.

With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.

But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?

3 comments

Since the hash is 60 bits, a collision is expected after 2^30 tries due to the birthday paradox. So by your math this would be 12.3GB. Although this is ignoring hash table overhead, which I expect to be at least 2x. Maybe they just got lucky.
I think I found a way to do it with ~8GB. As you mentioned you only need 2^30 tries due to birthday paradox.

But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.

The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).

Actually, using that method, you'd need much less than 8GB, since a collision would mean you'd hit a cycle, so you could store something like 1 value for every 2^10 iterations. When you found a collision with one of those, you could examine the interval more carefully to find exactly where the collision was.
Ahh yes, that's very clever. Thanks for sharing :)
This is essentially the idea behind rainbow tables, which let you make a configurable time/space tradeoff when trying to find a hash collision vs a fixed known dictionary/space.
Hellman (yes as in Diffie-Hellman) invented this idea (chaining a reduction function and a hash together over and over) for the improved time/space tradeoff. Hellman's idea actually predates the terrible Microsoft cryptosystems which are famously vulnerable to it, because too many software engineers don't realise Computer Science and thus Mathematics are applicable to their work whether they like it or not.

Oechslin improved on Hellman's idea to make "Rainbow Tables" by sightly perturbing the reduction function with the depth, so that collisions are fleeting unless they also coincidentally happen in the same position on a hash chain, this makes his Rainbow Table far more efficient/ reliable for the same pre-computation effort and it produces the "rainbow" picture in his slide deck.

For my purpose Oechslin's improvement was irrelevant of course because I actually wanted collisions.

But memory fades, it is possible either that I got lucky, that I actually had 16GB of RAM, or that I made some minor efficiency improvements that were enough to make it work.

Interesting, that's similar to rainbow tables.
You’re looking for a collision, any of them will do, so your computation doesn’t actually need to go through all the elements. https://en.m.wikipedia.org/wiki/Birthday_attack
You can also truncate hash at 30 bits and use it to address bucket, 12gb then would give you like 3x32 bits, which you can store your candidates. Then compare full hash with candidates.