|
|
|
|
|
by immibis
337 days ago
|
|
Wait until you learn the Tomasulo algorithm is not a magic wizard box for fast code, but that you can also write code that meshes well with it and get even more performance. Your CPU can run up to 5ish instructions every cycle - are you giving it instructions in suitably matched groups to maximize parallelism? Premature optimization may be the root of all evil, but timely optimization can give you insane benchmark numbers, at least. Not enough people pay attention to the word "premature" and they think it's about all optimization. Sadly, I've never worked on anything where it would have been timely. Haven't even used SIMD. Just watched other people's talks about it. Some general principles are widely applicable: modern processors love to work on big linear SIMD-aligned arrays of data. Hence you have the structure-of-arrays pattern and so on. While a CPU is still the best tool you have for complex branching, you should avoid complex branching as much as possible. In Z80-era architectures it was the reverse - following a pointer or a branch was free, which is why they invented things like linked lists and virtual functions. |
|
As a result, I was handling two 3000x3000 (dense and double precision floating point) matrices and some vectors under ~250MB RAM consumption, and getting in an out of whole integration routine under a minute. Whole evaluation took a bit more than a minute in last decade's hardware.
Perf numbers returned great. ~5IPC, 20% cache trash, completely and efficiently saturated FPU and memory bandwidth.
Result? The PoC was running in 30 minutes, my code was run and done under 2 minutes for most problems. The throughput was 1.7 million complete Gaussian Integrations per second, per core. We have an adaptive approach which takes many partial integrations for a one complete integration [0], so it amounted to a higher number of integrations (IIRC it was ~5 mil/sec/core, but my memory is hazy.).
The code I have written doesn't call for SIMD explicitly, but Eigen [1] most probably does for Matrix routines. Nevertheless, the core "secret sauce" was not the formulae but how it's implemented in a way which minds for the processor's taste. Adding "-march native -mtune native -O3 -G0" gave a 3x performance boost in some steps of the code.
Currently slowly reimplementing this in a tighter form, so let's see how it goes.
So, also considering what I do for a job, I know how nuanced Knuth's quote is, but people are people, and I'm not the one who likes to toot his horn 7/24/365.
I just wanted to share this to confirm, validate and support your claim only, and to continue conversation if you are interested more in this.
[0]: https://journals.tubitak.gov.tr/elektrik/vol29/iss2/45/
[1]: https://eigen.tuxfamily.org/