Hacker News new | ask | show | jobs
by sukaka 4625 days ago
seeing at least 1 collision the tests makes me wonder why the chance of a bitcoin address collision is almost nil. https://en.bitcoin.it/wiki/Weaknesses#Generating_tons_of_add...
4 comments

Bitcoin uses a cryptographic hash function, SHA256. There are, at present, no known collisions with SHA256.

This question is about non-cryptographic hash functions, as are commonly used in hash tables[1] which implement associative arrays, or sharding databases based on the hash of the record key.

Non-cryptographic hash functions are simply not in the same league as cryptographic ones; cryptographic hash functions are optimized to make it infeasible for an attacker to generate a different message with the same hash, while a non-cryptographic hash function generally simply has to have uniformly random output over its keyspace.

[1] http://en.wikipedia.org/wiki/Hash_table

All of which is true and well written. However it brings up the obvious point missed in the otherwise well written article that yes, if I use 32 bit hashes to shard into 2 to the power of 32 databases, some will have more collisions than others, aka the 32 MSB of a 32 bit hash has some "bad looking" variations. But I don't care about non-randomness in the LSBs because I'm infinitely more likely to shard into, perhaps, 2 to the power of 4 database machines.

The original article did not descend into that obvious area of research. I see no particular reason why a hash algo that has the worst randomness shoved into the least significant byte (which I simply don't care about) might be an inherent result of smooshing the best randomness into its most significant nibble, which I do care very much about. Given the likely use case for a sharding hash, a smart hash designer would make sure that most of his effort is put into smooth distribution in MSB and perhaps totally ignore the LSB for a given amount of latency / computation / electrical power. After all, the actual users are more likely to hash based on the first 4 bits than the first 24 bits. Although you'll always run into people who think its funny to pull their shard subset out of the hash using the LSB (why?) or some random byte in the middle (why?)

I think MSB for shard key comes out of the tradition in the really olden days of sharding based on raw unhashed data. Sometimes thats random enough such that the MSB of the data makes an excellent shard key and hashing would just slow things down for a minimal gain, even today.

Whoa, it's conventional to use the MSB? I usually just do

  bucket = hash(x) & mask
Should I expect non-cryptographic hashes and RNGs to produce better noise in the MSBs? That would explain "Sun Redefines Randomness"
Well, look on the bright side, at least you're not calculating

bucket = hash(x) % buckets

Unless you're stuck supporting non power of two number of shards in which case you need modulus. And shoving more randomness into the LSB is more important than the MSB.

I've thought about it some and I think anyone with a networking background is automatically going to "subnet" their data from MSB down out of pure raw habit. Of course you split top down, just like IP addrs.

I've also seen non-technical popular explanations of sharding which stereotypically use something like phone numbers and start at the most significant digit.

And a stereotypical test question is this does not work well with stuff like customer IDs or transaction IDs because they tend toward sequential-ish, unless you're talking about some kind of data mining thing not daily live transactions.

On the other hand it works well if you shard (sorta) by date if you use the right (usually non-native) format. So if you have dates like 2013-10-14 and shard by YYYYMM somehow, then it could be easy to wipe 201210 from the database because whichever shard its on, its probably not impacting the latency figures for 201310 today. Unless you wanted to speed up the delete by smoothing it across all shards in which case sharding by hash ends up being the smart idea.

Trying to do tricky things can turn into a mess when the tricky thing changes mid project, too.

This is actually an important topic for me. I am implementing a sharding distribution based on consistent hashing using MurmurHash3 as hash function.

I am taking the first 4-bytes of the hash function output and using that. I checked and MurmurHash3 mixes the first and last 8 bytes of hash output as a last step, but I am not sure how much differentiation there is in the first 4-bytes.

I guess it is something I should check.

These are 32-bit hashes. 160-bit hashes with similar characteristics are outrageously less likely to generate collisions. Something like 3.4 X 10^38 less likely if I'm running ColinWright's formula properly.

That difference is computationally extremely significant. It took the SO poster ~9ms to find a collision with, e.g., Murmur. If you mapped those results to the 160-bit hash, finding a collision, even ignoring the added time to compute the larger hash, would take 97 octillion years.

This is not the right way to look at it. Finding a collision on CRC256 (a 256-bit CRC code) is nowhere near as hard as finding a collision on SHA256 (assuming SHA256 is secure). A CRC code is just the remainder modulo a polynomial, so adding any multiple of that polynomial to the string will give you a collision. The probability of a pair of independently and uniformly sampled random strings having the same CRC256 hash is certainly small, but cryptographic hashes make a strong guarantee: even if the pair is not sampled independently or uniformly a collision is unlikely.

It is also worth pointing out that the hash size is not necessarily a measure of security. Very Smooth Hash is a good example of this: VSH has security that depends on the hardness of a problem that is closely related to integer factorization, and produces hashes that are as long as its parameters. You might need 3072 bit parameters for VSH to be secure, and will thus have 3072 bit hashes; but the hardness of finding a collision will be about as hard as brute-forcing a 128 bit keyspace (estimating these things is something of a dark art, and I am not an expert; it might be that VSH requires much larger parameters than RSA for equivalent security).

Of course an application like Bitcoin needs a cryptographic hash. But the parent comment was concerned that a single-digit number of collisions showed up in a couple hundred thousand test cases of a 32-bit hash. A similar count of collisions has a good chance of showing up with a 32-bit cryptographic hash, too. It's the cryptographic properties in concert with the size of the range that make the function secure for its intended use.
Cryptographic hashes are an entire category of hash functions unto themselves. The absolute minimum guarantee that a cryptographic hash makes is that for any polynomial time algorithm, the probability of the algorithm computing a pair (x,y) such that x != y and H(x) = H(y) is "negligible" i.e. is smaller than 1/p(n) for all polynomials p and large enough n; in other words, cryptographic hashes guarantee that collisions are extremely rare. Much like you can create a cipher with security that is reducible to the hardness of some problem (e.g. ElGamal is secure if the Diffie-Hellman problem is hard), it is possible to create collision resistant hashes with such security proofs, e.g. Very Smooth Hash, SWIFFT. In practice, though, we just use hashes that are believed to be security according to heuristic tests; hence the need to periodically update standard hash functions, rather than to just increase parameter sizes.

Also interesting is universal hashing, which has applicability in cryptography (for randomness extraction, among other things):

https://en.wikipedia.org/wiki/Universal_hashing

The length of the digest is 32 bits in the linked example and for Bitcoin is 160 bits. Assuming the hash functions are good, if you only have 32 bits of output you can expect to find a collision after 77,000 hashes; the examples used three times that! With 160 bits, you need to generate many orders of magnitude more hashes before a collision is likely.

See http://en.wikipedia.org/wiki/Birthday_attack