| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thoughtpolice 3999 days ago

I think the horsepower on those machines shouldn't be underestimated, because they are not entirely as equivalent as you think... I was thoroughly surprised when an unoptimized (but correct!) ChaCha20/8 implementation I wrote on a 3.0GHz POWER8 little-endian machine was about as fast as the latest 3.5gHz Xeons @ AES-256 with AESNI (about 1.3cpb vs 1.0cpb IIRC, but the latter has a dedicated hardware unit for it!) On that same Xeon, the ChaCha20 code only hit somewhere around 5cpb - that's software vs silicon!

It also has 170 cores and was actually a QEMU instance (w/ hardware virtualization extensions) vs raw dedicated metal. If you're doing any kind of numerical or analytic workloads (even databases), I wouldn't throw them aside so quickly. You can even get CUDA for them these days, and certain physical addons like CAPI allow you to map and coherently share physical CPU address space with FPGAs or GPUs. If I could get those things in a reasonable workstation configuration, I'd probably go for it tbh.

(I'd be more than willing to repeat this and post some more accurate numbers if anyone cares. I also need to get around to benchmarking AESNI vs that POWER8 machines _actual_ dedicated AES unit. The benchmark above was only flexing its vector/integer unit capabilities. ;)

2 comments

ajross 3999 days ago

If you're getting a 4x difference in IPC using a crypto microbenchmark from compiled C code (i.e. it doesn't sound like you're bandwidth or I/O limited), there has to be something else at work. POWER8 is a nice core, but it's not that wide. Maybe the compiler was recognizing your operations and replacing them with AES primitives?

link

rdtsc 3999 days ago

Caches and memory latency/bandwidth can have serious effects as well.

link

ajross 3999 days ago

Yes, but at this kind of multiplier only in the case where the entire test is 100% cache-resident on one CPU and spilling on the other. Crypto stuff tends to have small working sets, so my intuition is that it's got to be something else.

link

throwaway2048 3999 days ago

an ASM optimized chacha20 is faster than AES-NI on newer intel chips.

link