Hacker News new | ask | show | jobs
by fc417fc802 146 days ago
> The 32 bit hash of CRC32 is too low for file checksums.

What makes you say this? I agree that there are better algorithms than CRC32 for this usecase, but if I was implementing something I'd most likely still truncate the hash to somewhere in the same ballpark (likely either 32, 48, or 64 bits).

Note that the purpose of the hash is important. These aren't being used for deduplication where you need a guaranteed unique value between all independently queried pieces of data globally but rather just to detect file corruption. At 32 bits you have only a 1 out of 2^(32-1) chance of a false negative. That should be more than enough. By the time you make it to 64 bits, if you encounter a corrupted file once _every nanosecond_ for the next 500 years or so you would expect to miss only a single event. That is a rather absurd level of reliability in my view.

1 comments

I've seen few arguments that with the amount of data we have today the 2^(32-1) chance can happen, but I can't vouch their calculations were done correctly.

Readme in SMHasher test suite also seems to indicate that 32 bits might be too few for file checksums:

"Hash functions for symbol tables or hash tables typically use 32 bit hashes, for databases, file systems and file checksums typically 64 or 128bit, for crypto now starting with 256 bit."

That's vaguely describing common practices, not what's actually necessary or why. It also doesn't address my note that the purpose of the hash is important. Are "file systems" and "file checksums" referring to globally unique handles, content addressed tables, detection of bitrot, or something else?

For detecting file corruption the amount of data alone isn't the issue. Rather what matters is the rate at which corruption events occur. If I have 20 TiB of data and experience corruption at a rate of only 1 event per TiB per year (for simplicity assume each event occurs in a separate file) that's only 20 events per year. I don't know about you but I'm not worried about the false negative rate on that at 32 bits. And from personal experience that hypothetical is a gross overestimation of real world corruption rates.

It depends on how you calculate statistics. If you are designing a file format that over the lifetime of the format hundreds of millions of user will use (storing billions of files), what are the chances that 32 bits checksum won't be able to catch at least one corruption? During transfer over unstable wireless internet connection, storage on cheap flash drive, poor HDD with a higher error rate, unstable RAM etc. We want to avoid data corruption if we can even in less then ideal conditions. Cost of going from 32 bit to 64 bit hashes is very small.
No, it doesn't "depend on how you calculate statistics". Or rather you are not asking the right question. We do not care if a different person suffers a false negative. The question is if you, personally, are likely to suffer a false negative. In other words, will any given real world deployment of the solution be expected to suffer from an unacceptably high rate of false negatives?

Answering that requires figuring out two things. The sort of real world deployment you're designing for and what the acceptable false negative rate is. For an extremely conservative lower bound suppose 1 error per TiB per year and suppose 1000 TiB of storage. That gives a 99.99998% success rate for any given year. That translates to expecting 1 false negative every 4 million years.

I don't know about you but I certainly don't have anywhere near a petabyte of data, I don't suffer corruption at anywhere near a rate of 1 event per TiB per year, and I'm not in the business of archiving digital data on a geological timeframe.

32 bits is more than fit for purpose.

I can't say I agree with your logic here. We are not talking about any specific backup or anything like that. We are talking about the design of a file format that is going to be used globally.

Business running a lottery has to calculate the odds of anyone winning, not just the odds of a single person winning. Same, a designer of a file format has to consider chances for all users. What percent of users will be affected by any design decision.

For example, what if you would offer a guarantee that 32 bit hash will protect you from corruption, and compensate generously anyone who would get this type of corruption; how would you calculate probability then?

If you offer compensation then of course you need to consider your risk exposure, ie total users. That's similar to a lottery where the central authority is concerned with all payouts while an individual is only concerned with their own payout.

Outside of brand reputation issues that is not how real world products are designed. You design a tool for the specific task it will be used for. You don't run your statistics in aggregate based on the expected number of customers.

Users are independent from one another. If the population doubles my filesystem doesn't suddenly become less reliable. If more people purchase the same laptop that I have the chance of mine failing doesn't suddenly go up. If more people deep fry things in their kitchen my own personal risk of a kitchen fire isn't increased regardless of how busy the fire department might become.