Hacker News new | ask | show | jobs
by inkyoto 1477 days ago
Low level runtime optimisation that yields substantial performance gains in the user facing or system level software, ranging from cryptography through to data processing algorithms and very high throughput JSON parsing.

Take OpenSSL as an isolated example. By simply fiddling with the C compiler flags to allow it to use NEON on M1, the sha256 calculation speed-up is 4x for 128 and 256 block sizes, with performance gains quickly tapering off for larger block sizes and resutling in a modest 10% increase only. And that performance increase happens without the involvement of hash functions having been manually optimised for NEON/SVE1.

SVE2 with its variable vector size support could improve performance for larger unit sizes. Perhaps it is the time to spin up a Graviton3 instance and poke around with clang/gcc to see how actually good or faster the SVE2 is.

1 comments

Yeah that's NEON. And there's instructions that literally calculate SHA256 so generalizing that is moot. My point was first, what real benchmarks are there of SVE2's benefits over NEON with mainstream CPUs that M2 would compete against? Unlike AVX-512, NEON was already pretty rich, so the new instructions have rather specialized usefulness.

Because outside of servers where little cores don't exist, 256b ALUs in big cores mean 256b registers in little cores, and Cortex-A510 is way smaller than Gracemont. And then you're giving Samsung another opportunity to screw up big.LITTLE...

And even the server CPUs with SVE are 2x256b except A64FX which is HPC exclusive, so no better than 4x128b.

SVE2 does not increase the maximum speed. That depends only on the width and number of the ALUs, on the number of cores and on the clock frequency.

The purpose of SVE2 is to simplify the writing of the software that exploits the data parallelism, both when that is done manually and when that is done automatically by an autovectorizing compiler.

With SVE2 it should become much easier to deal with data structures where the sizes and the alignments are not multiples of the ALU width and it will also no longer be necessary to write many alternative code paths, to take advantage of any future better CPUs, like when optimizing for Intel SSE/AVX/AVX2/AVX-512.

There are still a majority of programs that do not utilize as frequently as possible the existing SIMD units. With SVE2, their number should diminish.