Hacker News new | ask | show | jobs
by _prometheus 3395 days ago
Hey everyone. The SHAttered break made us release a new website for Multiformats: http://multiformats.io which includes a page explaining Multihash -- http://multiformats.io/multihash --

The page goes through a bunch of examples of how a multihash would work in practice, why it is useful to do this, and goes through some FAQs.

I'm writing a post about all this that I'll push out soon, but wanted to drop the site here now.

Please, as you make new systems or start to upgrade your systems out of SHA1 (or MD5!!!), please use Multihash so that your upgrades will be much, much easier in the future. This makes rolling up a matter of recompiling some tools, and by far not facing the horrible breakages that occur when tools and scripts assume certain things (like the hash digest of git will always be 160bits... now git faces breaking a lot of things and may have to stick with 160bits of a truncated SHA3 or BLAKE2...)

Oh also-- definitely check out BLAKE2, it's a fantastic hash function from Zooko, JP Aumasson, Samuel Neves, and Christian Winnerlein -- https://blake2.net -- use it for all your hashing needs! So much faster than SHA2, and has the same "unbrokenness" that SHA3 enjoys. (And, I believe deep cryptanalysis has gone into BLAKE, Keccak is not particularly safer)

The Multiformats site also has a Multiaddr description, but that's far from complete and doesn't have the examples. The other multiformats arent well explained on the website yet. Sorry, coming soon :) we wanted to get this out ASAP -- PR's accepted at https://github.com/multiformats/website (oh and yes, TLS certs coming too)

3 comments

How would multihash handle parameterized algorithms like siphash, which take multiple arbitrary parameters? Adding every combination of "number of rounds" and "number of finalization rounds" as a separate table entry seems problematic.
As long as the algorithm is sufficiently described, the parameters can be part of the body. So instead of `<hash-type><hash-result>` it would be `<hash-type><hash-rounds><hash-result>`.
@hobofan Yeah, exactly. We've explored this by making the value of such hashes (when generated and checked with multihash libs) literally be the `<hash-rounds><hash-result>` pair. This avoids complicating the table or multihash model/implementations. We can just redefine the hash function verify to use the first bits as the round, etc. We haven't done this yet, but we may.
See https://news.ycombinator.com/item?id=13740369 -- does that make sense? the idea is to treat those parameters as "part of the hash digest value", and write wrappers for siphash and functions like it, s.t.:

  mh_siphash_digest(input, length, rounds) {
    digest := siphash_digest(input, length, rounds)
    mhdigest := concat(rounds, digest)
    return mhdigest
  }

  mh_siphash_verify(input, mhdigest) {
    mh := multihash.parse(mhdigest)
    rounds := mh.digest[0]    # the rounds
    expected := mh.digest[1:] # rest
    actual := siphash_digest(input, mh.length, rounds)
    return expected == actual
  }
Makes perfect sense, thanks.
It seems like it might be better to put the metadata at the end. That would make it easier to truncate to a certain amount of entropy. Also, it would make it possible to trim off the metadata without having to understand the format to figure out the variable length.
What are you going to do with the hash after you trim off the metadata if you don't understand the format?
I've dealt with variable-length headers on (binary) things before, and it creates annoyances. It screws up the memory alignment, for instance.
I have two minor, but useful, uses:

One is that the prefix is probably unique for small sets. So it's both easy to read and parse and then use just a prefix for doing something like stopping a docker container.

Second is that a short prefix on a larger set of data is probably not unique but at a rate I can predict. I can take a random, and repeatable, sample of data by selecting everything with the same prefix. This gives me a fast way of taking a random sample from a database (or file or anything else).

Good question. There are instances where it allows you to write cleaner (and faster) code if you can identify the parts without necessarily having to parse the metadata.
You could probably omit the length, because the length is probably already known implicitly. E.g. git knows the length of its hashes without having to read the hash. If the length < the full hash, it can be assumed to be truncated.
Some hashing algorithms have a configurable output length. You need to encode the length, or have it as part of the type, and it's more uniform to just have it separate from the type.
I guess it depends on how precious space is. A naked hash is very information-dense. In certain applications, inserting a prefix of several bytes to each hash makes a difference. OTOH, if the hashes end up being inserted into a table in MySQL, then space is probably not that precious.
Attackers can exploit that by cutting off a stream. it's best to be explicit.
Hashes don't currently know their own lengths, so I don't see why they'd need to.
Putting this kind of metadata inline with your hashes is the entire point of multihash:

> Multihash is a protocol for differentiating outputs from various well-established hash functions, addressing size + encoding considerations. It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.[0]

[0]: http://multiformats.io/multihash/

In case of multihash length field allows for two things, 1. truncation; 2. you don't have to know the hashing function to transparently pass it through buffers and/or compare.
Danger: truncation shouldn't be done by just chopping off the end of a hash; they should be different hash functions with entirely different images.

Compare, for instance, multihash's treatment of SHA512 truncated to 256-bits, and the standardised hash function SHA512/256. The paper which introduced SHA512/256 even gives a generic way to safely truncate SHA512: https://eprint.iacr.org/2010/548.pdf section 5.

Multihash's design begs implementations to validate these things by allowing the sender to arbitrarily truncate them. That's bad.

Yeah this is a good point, which we discussed somewhere (not finding the issue atm). The resolution was to treat those as different hash function codes, because the normal usage of truncating hashes by chopping off bits is extremely common. We had direct use cases from past experience trying to help old systems that did things like take a sha2-512, trunc to 256bits and use instead of sha2-256 (for the speedup on some archs). So we saw the need for _literally_ a different size of the exact same function.

When the functions encourage it, we support the addition of the specific different constructions (if named): https://github.com/multiformats/multihash/blob/master/hashta... -- we did a silly thing with Blake2 where we imported all the valid numbers. (this is suboptimal in table space, but super explicit).

Are there other functions you think we should add for the sha2-256/512 set?

> Multihash's design begs implementations to validate these things by allowing the sender to arbitrarily truncate them

If the sender is manipulating the hash you get (i.e. changing the length prefix counts), you're already in huge trouble. They could change the code and the value too. The threat model here is that the hash you have cannot be altered by the attacker. If the attacker manages to truncate a stream to get you to think it's a shorter hash, the attack fails as you have the length to tell you what you should be expecting. (again, the crux here is that if the attacker can change the length bits they can probably also change the function and own you anyway)

Also-- as noted in https://github.com/multiformats/multihash/issues/70 -- we should make implementations allow clients to lock the hash function and length combinations they want to use, so that attackers cannot manipulate those parameters.

You could be a little bit opinionated and limit the choice of formats to a small set of ones that are a good idea. That would encourage good design and it would allow you to shrink the metadata. Perhaps one byte would suffice. It's not necessary to support every algorithm/length under the sun.
I'm thinking that it might be useful to include a magic number, so that you can distinguish a Multihash from a plain hash.