Hacker News new | ask | show | jobs
by api 1787 days ago
Security-wise they are roughly equivalent. While SHA has had more eyes on it I doubt either construction will ever be practically broken at hash sizes like 384 or 512 bits. Someone may find an "academic break" at some point.

BLAKE3 is faster, sometimes a lot faster, on hardware without SHA instructions. On hardware with SHA instructions SHA may be faster. Same as the AES story where AES is faster than ChaCha on CPUs with dedicated hardware but slower otherwise.

4 comments

According to zooko, one of the authors, in new-ish cpus blake3 beats sha256 even with hardware acceleration: https://twitter.com/zooko/status/1419403567320821760
I'd like to see the benchmarks, including power draw. I suspect it is similar to soft ChaCha vs hard AES. A ChaCha software implementation can achieve similar speed as the AES hardware at the cost of significantly higher power draw due to pushing the AVX units at near maximum utilization.
I benchmarked hardware AES vs software ChaCha20, and the former showed an overall performance improvement of an end to end QUIC software stack of more than 50%. The pure crypto difference is probably even higher. That's a huge gap - even thought it might totally be possible that the ChaCha20 implementation of Ring is still improvable.

As a result of that, I asked for rustls to default to AES instead of the previous ChaCha20 default [1]

[1] https://github.com/ctz/rustls/issues/509

Some implementations (IIRC Firefox is one of them) choose it dynamically: if hardware AES is available, AES is ordered before ChaCha20, otherwise ChaCha20 is ordered first. Last time I looked, at least for Intel the i3 didn't have hardware AES (but the i5 and i7 have it), so it was not uncommon for lower-end hardware (which needs speed the most) to have AES slower than ChaCha20.
It should be noted that chacha20 has insanely comfortable security margin. With a more accurate estimate chacha8 has a security margin similar to that of AES, and chacha20 has 2.5 times more rounds, see if it's worth the cost.
Unfortunately almost no one can use ChaCha8 since the standards all call for ChaCha20.
And if you're designing hardware (or on an FPGA) ChaCha (and BLAKE2/3) are actually really freakin' fast for the die area they take. IIRC ChaCha20 beats AES in hardware. It's just extremely rare (FPGA only) to find it in hardware, since there's not enough demand. I expect that to eventually change.
Given that all these are ARX cores, I wonder if a fused ARX instruction could cover a wide range of them?
ARMv8.2 has rotate-and-xor and xor-and-rotate so that the (extremely cheap) xor can be saved.
Probably. With a barrel shifter easily. Barrel shifters are a bit slower than wired shifts though, so for ultimate speed you'd end up with hardware shift amounts.
The advantage would be a single instruction that implemented the core of a wide range of things including BLAKE2, BLAKE3, Salsa, ChaCha, Speck, and more.
One aspect of security is also the misuse resistance. You can of course create a secure MAC with SHA256 in the HMAC configuration, but it usually takes a masters level course on cryptography to know what is Merkle-Damgård construction, and why it's design is imperfect:

You can't just do SHA256(key + message) to generate a safe MAC. With BLAKE (and all SHA3 finalists) you can do that safely.

It's true every time you make the algorithm more misuse resistant, the universe will come up with a more dunning Kruger, but despite that, it's something that can actually improve, the security is already more than adequate: Like Schneier so eloquently put it, "we're building a fence for sheep, it doesn't matter if the fence pole is a mile or two miles high".

Good point. Misuse resistance is also why I am a fan of SIV constructions for stream ciphers, since "repeat a nonce = instant death" is a footgun.

Repeating a nonce is easier than you might think if you are using threads and accessing a nonce counter non-atomically, have a bad RNG, are on an embedded platform with bad RNG seeding, have a bug that overwrites some memory used to generate nonces, or just transfer a ton of data with the same key (birthday attack). SIV makes nonce reuse fairly benign. The only consequence is that if you happen to reuse a nonce with two identical messages, an attacker could tell that you sent the same message twice. That's generally not catastrophic and statistically is far less likely than repeating a nonce with different messages. Repeating a nonce with different messages generally does nothing in SIV.

You could theoretically use SIV with no nonce, with the only consequence being that an attacker could always tell if you sent duplicate messages. Not sure why you'd do that though.

IMHO since we now have ciphers that are probably "unbreakable for the foreseeable future" (e.g. AES and ChaCha) we should probably concentrate on creating and popularizing misuse-resistant constructions as much as possible. It's good to remove footguns.

Hadn't really looked into SIV as I've only written stuff that always generates XChaCha nonces with getrandom but yeah I can totally see why the platform etc. could cause issues that lead to nonce-reuse. This was most informative post, thank you so much!
SIV is usually done with AES/GMAC constructions but you could do it with ChaChaPoly just fine.

The big downside is that it requires two passes on encrypt: one to create the MAC and derive the IV and another to encrypt. The overhead for this is small for message/packet based systems though since after pass one the data will be sitting hot in the processor's L0 cache. Decryption can be done in one pass.

Aren't you supposed to Mac the encrypted data?
> You can't just do SHA256(key + message) to generate a safe MAC.

Can you explain this?

A Sha256 hash is just a dump of the internal state of the function. If you know the hash, you can keep running the hash function for more data and calculate a new hash for the original data with new data appended.
What @dagenix said. See e.g. Thomas Pornin's answer here https://crypto.stackexchange.com/a/3979 for more details
If you have the output

    h = SHA-256(k || m1)
you can easily compute a function `F(h, m2)` such that

    SHA-256(k || m1 || m2) = F(h, m2)
allowing you to forge a verifier for `m1 || m2` under `k` for any `m2` you wish without actually knowing `k`.
You can with the truncated versions, though.
single-threaded blake3 is about 1.9x faster than (hardware) sha256 on my Zen 2 CPU.