|
|
|
|
|
by pclmulqdq
795 days ago
|
|
That's not quite a correct benchmark if I'm understanding what you benchmarked. Xoshiro and PCG need a chain of serial ops to generate a number, but not a lot of parallelism, letting the core do other stuff while it is working. ChaCha needs a lot of parallelism, and you are essentially saturating that core running it. |
|
I also ran it on x86 (AMD Ryzen 5600X) and got similar results; everything ran faster on a 4.6 GHz chip, but ChaCha8 stayed around 2 cpb while PCG improved a little to about 0.7 cpb.
I do have a ~10-year-old Celeron NUC laying around that I could use to test older/lower-end hardware. I can also publish the (admittedly very simple) program I'm using.
It's also worth noting that cpb is still a measure of average throughput not variance or latency and while Go's buffered ChaCha8 implementation amortizes to 2 cpb, when the 32-element buffer exhausts, generating a new number is much more expensive than the previous 31 numbers. I wouldn't say the performance is good enough for every use case.