| - Have you ever, e.g., computed the sinus of a floating point number in C (sinf) ? - Have you ever multiplied a matrix with a vector, or a matrix with a matrix (GEMM) using BLAS? - Have you ever done an FFT ? - Have you used C++ barriers? Or pthreads? Or mutexes? An optimized implementation achieves ~100% of theoretical peak performance of a CPU on all of those, and these are all tailored to each CPU model. There is software on any running system doing those things all the time.
Running at 0% of the peak just means increased power consumption, latency, time to finish, etc. Generic versions perform at < 100%, often at ~0% (0.1%, 0.001%, etc.) of theoretical peak. Somebody has to write software for doing this things for the actual hardware, so that you can then call them from python. IBM has dozens of "open source" bounties open for PowerPC, and they pay real $$$, but nobody implements them. --- Porting software to PowerPC is only as simple as doing make if the libraries your software uses (the C standard library, the libm library, BLAS, etc. ) all have optimized implementations, which isn't the case. So when considering PowerPC, you have to divide the paper numbers by 100 if you want to get the actual numbers normal code recompiled with make gets in practice. And then you have to invest extra $$$ into improving that software, cause nobody will do it for you. |
While it isn't necessarily clear what peak performance means, MKL or OpenBLAS, for instance, is only ~100% of serial peak on large DGEMM for a value of 100 = 90; ESSL is similar. I haven't measured GEMV (ultimately memory-bound), but I got ~75% of hand-optimized DGEMM performance on Haswell with pure C, and I'd expect similar on POWER if I measured. Those orders of magnitude are orders off, even for, say, reference BLAS. I don't know why I need Python, but the software clearly exists -- all those things and more (like vectorized libm). You can even compile assorted x86 intrinsics on POWER, though I don't know how well they perform relative to on equivalent x86, but I think you're typically better off with an optimizing compiler anyway.
I've packaged a lot of HPC/research software, which is almost all available for ppc64le; the only things missing are dmtcp, proot, and libxsmm (if libsmm isn't good enough).