| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nulptr 2077 days ago

I have a question about how you generated collisions.

If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.

With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.

But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?

3 comments

Thorrez 2077 days ago

Since the hash is 60 bits, a collision is expected after 2^30 tries due to the birthday paradox. So by your math this would be 12.3GB. Although this is ignoring hash table overhead, which I expect to be at least 2x. Maybe they just got lucky.

link

nulptr 2077 days ago

I think I found a way to do it with ~8GB. As you mentioned you only need 2^30 tries due to birthday paradox.

But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.

The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).

link

penteract 2077 days ago

Actually, using that method, you'd need much less than 8GB, since a collision would mean you'd hit a cycle, so you could store something like 1 value for every 2^10 iterations. When you found a collision with one of those, you could examine the interval more carefully to find exactly where the collision was.

link

nulptr 2077 days ago

Ahh yes, that's very clever. Thanks for sharing :)

link

jasonwatkinspdx 2077 days ago

This is essentially the idea behind rainbow tables, which let you make a configurable time/space tradeoff when trying to find a hash collision vs a fixed known dictionary/space.

link

tialaramex 2076 days ago

Hellman (yes as in Diffie-Hellman) invented this idea (chaining a reduction function and a hash together over and over) for the improved time/space tradeoff. Hellman's idea actually predates the terrible Microsoft cryptosystems which are famously vulnerable to it, because too many software engineers don't realise Computer Science and thus Mathematics are applicable to their work whether they like it or not.

Oechslin improved on Hellman's idea to make "Rainbow Tables" by sightly perturbing the reduction function with the depth, so that collisions are fleeting unless they also coincidentally happen in the same position on a hash chain, this makes his Rainbow Table far more efficient/ reliable for the same pre-computation effort and it produces the "rainbow" picture in his slide deck.

For my purpose Oechslin's improvement was irrelevant of course because I actually wanted collisions.

But memory fades, it is possible either that I got lucky, that I actually had 16GB of RAM, or that I made some minor efficiency improvements that were enough to make it work.

link

Thorrez 2077 days ago

Interesting, that's similar to rainbow tables.

link

manulp 2077 days ago

You’re looking for a collision, any of them will do, so your computation doesn’t actually need to go through all the elements. https://en.m.wikipedia.org/wiki/Birthday_attack

link

bluesign 2077 days ago

You can also truncate hash at 30 bits and use it to address bucket, 12gb then would give you like 3x32 bits, which you can store your candidates. Then compare full hash with candidates.

link