Hacker News new | ask | show | jobs
by mtdewcmu 3396 days ago
You could probably omit the length, because the length is probably already known implicitly. E.g. git knows the length of its hashes without having to read the hash. If the length < the full hash, it can be assumed to be truncated.
3 comments

Some hashing algorithms have a configurable output length. You need to encode the length, or have it as part of the type, and it's more uniform to just have it separate from the type.
I guess it depends on how precious space is. A naked hash is very information-dense. In certain applications, inserting a prefix of several bytes to each hash makes a difference. OTOH, if the hashes end up being inserted into a table in MySQL, then space is probably not that precious.
Attackers can exploit that by cutting off a stream. it's best to be explicit.
Hashes don't currently know their own lengths, so I don't see why they'd need to.
Putting this kind of metadata inline with your hashes is the entire point of multihash:

> Multihash is a protocol for differentiating outputs from various well-established hash functions, addressing size + encoding considerations. It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.[0]

[0]: http://multiformats.io/multihash/

In case of multihash length field allows for two things, 1. truncation; 2. you don't have to know the hashing function to transparently pass it through buffers and/or compare.
Danger: truncation shouldn't be done by just chopping off the end of a hash; they should be different hash functions with entirely different images.

Compare, for instance, multihash's treatment of SHA512 truncated to 256-bits, and the standardised hash function SHA512/256. The paper which introduced SHA512/256 even gives a generic way to safely truncate SHA512: https://eprint.iacr.org/2010/548.pdf section 5.

Multihash's design begs implementations to validate these things by allowing the sender to arbitrarily truncate them. That's bad.

Yeah this is a good point, which we discussed somewhere (not finding the issue atm). The resolution was to treat those as different hash function codes, because the normal usage of truncating hashes by chopping off bits is extremely common. We had direct use cases from past experience trying to help old systems that did things like take a sha2-512, trunc to 256bits and use instead of sha2-256 (for the speedup on some archs). So we saw the need for _literally_ a different size of the exact same function.

When the functions encourage it, we support the addition of the specific different constructions (if named): https://github.com/multiformats/multihash/blob/master/hashta... -- we did a silly thing with Blake2 where we imported all the valid numbers. (this is suboptimal in table space, but super explicit).

Are there other functions you think we should add for the sha2-256/512 set?

> Multihash's design begs implementations to validate these things by allowing the sender to arbitrarily truncate them

If the sender is manipulating the hash you get (i.e. changing the length prefix counts), you're already in huge trouble. They could change the code and the value too. The threat model here is that the hash you have cannot be altered by the attacker. If the attacker manages to truncate a stream to get you to think it's a shorter hash, the attack fails as you have the length to tell you what you should be expecting. (again, the crux here is that if the attacker can change the length bits they can probably also change the function and own you anyway)

Also-- as noted in https://github.com/multiformats/multihash/issues/70 -- we should make implementations allow clients to lock the hash function and length combinations they want to use, so that attackers cannot manipulate those parameters.

You could be a little bit opinionated and limit the choice of formats to a small set of ones that are a good idea. That would encourage good design and it would allow you to shrink the metadata. Perhaps one byte would suffice. It's not necessary to support every algorithm/length under the sun.
I'm thinking that it might be useful to include a magic number, so that you can distinguish a Multihash from a plain hash.