It is not directly in the article, but in a link to a tweet by djb, the creator of ChaCha8. He believes that the cpb listed in the Randen comparison is off:
He mentions that perhaps the implementation of ChaCha8 for the benchmark is done by hand and unoptimized. And it is true from what I saw that a lot of benchmarks with ChaCha8 are implemented with none of the tweaks that make it fast.
https://twitter.com/hashbreaker/status/1023965175219728386
He mentions that perhaps the implementation of ChaCha8 for the benchmark is done by hand and unoptimized. And it is true from what I saw that a lot of benchmarks with ChaCha8 are implemented with none of the tweaks that make it fast.
In this instance, it looks like the Randen author didn’t reimplement it from scratch, but they used an SSE implementation, not an AVX2 one, which would have been faster: https://github.com/google/randen/blob/1365a91bafc04ba491ce79...