Hacker News new | ask | show | jobs
by eternalban 3453 days ago
Great stuff, thanks. (Just realized who you are! Like your Go work.)

Not sure about the SIMD angle & Java. Ref. impl. of Blake2B [in C] doesn't use it [last I looked at it]. I ~think it has to do with the endian bias of the algo.

1 comments

(Thanks!:)

There are a few optimized implementation in the BLAKE2 code package — https://github.com/BLAKE2/BLAKE2. "Ref" indeed doesn't use SIMD, but other implementations do use them. I think the claim that it's faster than MD5 is based on benchmarking SIMD implementations, and if I remember correctly, the reference one is a bit slower.

Yes, endianness conversion — that is, just reading uint64 or uint32 from byte arrays — if done in the most simple way by bit shifting can be a factor. Another factor that influenced performance of pure Go implementation (should be similar to Java) is bounds checking.

"BLAKE2 comes in two main flavors: BLAKE2b is optimized for 64-bit platforms, and BLAKE2s for smaller architectures. On 64-bit platforms, BLAKE2 is often faster than MD5, yet provides security similar to that of SHA-3. We specify parallel versions BLAKE2bp and BLAKE2sp that are up to 4 and 8 times faster, by taking advantage of SIMD and/or multiple cores."

Well, it's supposed to beat MD5 straight up. Granted the claim is a bit hedged.

> if done in the most simple way by bit shifting can be a factor

Do you mean using byte swap instructions or is there an algorithm that I should know about? :)

Parallel versions of BLAKE2 are indeed even faster, but optimized implementations of non-parallel BLAKE2s and BLAKE2b also use SIMD instructions in compression function (e.g. https://github.com/BLAKE2/BLAKE2/blob/master/sse/blake2b-rou...).

* * *

Not byte swapping, just simple reading bytes and shifting them into the correct position in uint32 like this:

   uint32 result = (b[3] << 24) | (b[2] << 16)  | (b[1] <<  8) | b[0];
Which is slower than

   uint32 result = *b;
or if LE is not native:

   uint32 result = swap(*b);
> optimized implementations of non-parallel BLAKE2s and BLAKE2b also use SIMD

Sure, have seen those. No biggie, but my point was that vanila B2b (on C) is already neck and neck with MD5.

> uint32 result = (b[3] << 24) | (b[2] << 16) | (b[1] << 8) | b[0];

Yep, akaik that's all you can do on the JVM & of course for 64bit it's even more work. I thought you meant some sort of magic number bit wizzardry :)