Hacker News new | ask | show | jobs
by jbeda 3926 days ago
Using a hash or CRC here is totally necessary. Often times CRCs in TCP fail due to corruption outside the network stack. Having an end to end check will catch, say, memory bit flips and such after data comes off the wire.

But there is no call for a cryptographic hash here. This isn't being used as any sort of ID or to verify integrity outside of corruption.

3 comments

No, it's pretty much totally unnecessary.

The API works on top of TLS, which already includes cryptographic authentication of all data (usually via SHA-1/2 HMAC or AES-GCM).

The hash would be computed at the client right after reading from disk and right before TLS enryption, and since they seem to terminate TLS at the storage server it would be computed right after TLS decryption and right before storage, so it doesn't seem to provide any gain.

I think they should just remove it, or at least make it optional.

When operating at scale, you will, once in a while, have corruption. Even if you use ECC RAM, once in a while you'll have a double bit flip. And it doesn't look like Backblaze uses ECC (https://www.backblaze.com/blog/storage-pod-4-5-tweaking-a-pr...) despite good evidence that ECC is necessary (PDF: http://static.googleusercontent.com/media/research.google.co...). Even if you do have ECC, you'll once in a while have a bad NIC that with HW offload that will corrupt the TCP stream silently.

This is all rare, but it does happen. This is why the GCS team wants to know if you are seeing corruption on file upload as it might be some bad hardware failing in a non-obvious way.

I just spent 10 or so minutes and it looks like they do use ECC, and per https://news.ycombinator.com/item?id=2786695 see ECC corrections reported in their log files.
As jbeda mentions, hardware errors are one big reason: with the scale S3/Azure/GCS/Backblaze operate at it's a matter of when and not if you're going to run into problems. Also: TLS may guarantee the bits your client sends are the ones their server receives, but that's just one cause of errors.

There's the write path from B2 receives your bits to when they're stored on disk, for one. You could have unforeseen bugs in the code sitting on the other end of their upload URL (it's probably not all theirs, and even if it was it was written by human developers).

Or B2's internal network path (if they have any) between that and the disk. Ideally that would provide integrity too, but maybe not. They offer a low price point and call out other compromises they make to achieve it (e.g. limited load balancing) - so while I really doubt it, it's remotely plausible they deem the internal overhead of SSL too high.

But then there's the potential for mismatch between "what the customer thinks they uploaded" and "what the customer actually uploaded" too! Less of an issue for now because their API only appears to support uploading files all at once, but eventually I'm sure they'll support a multipart upload scheme like the other platforms do. At which point uploads become more complicated since clients need to retain state and potentially resume. What if a client screws it up and there's some off-by-one error (or whatever)? If you can provide instant feedback, at upload time, that your clients provided bogus data, that's a good thing.

You can argue it's a painful requirement to force on users since it means they have to track/compute it themselves (might be nontrivial for streaming applications), which is fair. But there are enough points of failure, and the numbers so large, that errors happening is a fact and you really need to insure against it. Especially here, your entire reason for existing is to reliably store bits so it's kinda important to get it provably right.

It seems completely sensible to err on the side of caution, especially as a new and relatively unproven platform (as an object storage platform provider I mean, obviously they have tons of experience storing things).

There are many places in your stack where data corruption can and will occur. You are correct that TLS provides payload integrity on a per-packet basis - but it doesn't protect you against silent truncation (to fight this, always declare and check content-length, or use chunked encoding). I have seen corruption occur in NIC buffers, ECC'd main memory, Xen MMU'd memory pages (yes, Xen was responsible), and multiple places in HTTP server and client stacks. None of those failures manifested until hundreds of terabytes of data had successfully gone through the system.

If you're handling data on behalf of others, it's paramount that you checksum data end-to-end. Amazon S3 allows you to do this by sending the MD5 or SHA along with the data. Google GCE allows you to do this with CRCs (which, despite what others in this thread say, are more appropriate for the task than crypto hashes, as long as you use enough bits).

A cryptographic hash is pretty much as fast as anything else, and lets you be certain. There's no good reason to use anything else.
Apparently, SHA-1 is pretty slow compared to others, about 20x slower than the fastest hash algorithms out there.

https://github.com/Cyan4973/xxHash

You would think, that if it's just being used as a checksum, anything that passes https://code.google.com/p/smhasher/wiki/SMHasher with high marks would be sufficient.

Why would you want 'just' a checksum? I want something I can rely on. If I have to dedicate half a core per gbps of internet-crossing upload, that's not a big deal.
The purpose here is not to secure your data against an attacker (that's what TLS is for), or even against errors in transmission (as others have noted, TLS has you covered there as well) - you need something simple and inexpensive to secure against errors in hardware/memory before/after it enters that pipeline. While you shouldn't under-solve a problem, there are real costs to over-solving the problem as well.
You don't need a real attacker to want safety from assumptions that will be true the vast majority of the time, such as "same hash = same file".

For example, I might have md5-colliding files on my hard drive somewhere, that someone else made as a proof of concept. I honestly don't know. But I would worry about using a storage system that depends on md5, because what if it deduplicates without checking every byte?

For the same reason that UTF-16 has encouraged so many broken implementations, at least in a pre-emoji world, it's a bad idea to almost but not quite support convenient features. Either clearly don't support something, or fully support it.

The CRC in TCP is not powerful enough, but CRCs can be adjusted to be arbitrarily powerful. The main advantage of CRCs is that they can be independently computed for multiple parts and combined when concatenating the parts.