| Recent LLVM and GCC make good use of avx512. Although for GCC, you need to use the option -mprefervectorwidth=512. Unfortunately benchmarks on websites like Phoronix do not make use of them.
But when it comes to numerical computing, it is a boon.
Much easier to take advantage of than the GPU. I run (Monte Carlo) simulations that take hours or days. These can be vectorized, but I've never heard of someone being able to run them on a GPU. However, folks bring graphics cards up every time I mention (my love of) avx512. There is always a first, so I do really want to find the time to play around with it, and see how many mid-sized chunks can be woven together. And how memory/cache plays out when breaking things into small pieces. The last Monte Carlo simulation I ran took a few days to get 100 iterations.
The MC iterations themselves were chains of Markov Chain Monte Carlo iterations. Each of these MCMC iterations takes several seconds. Therefore, to move to a GPU, I'd like to parallelize between MC iterations, and also within the MCMC iterations. On a CPU all you have to do is vectorize the MCMC iterations, and then run the chains in parallel. |
I'm surprised; intuitively (though, mind you, as someone who has never done GPGPU programming, only read articles about it), I'd think some combination of 1. a CPU-RNG-seeded simplex-noise kernel, for per-core randomness; and 2. a cellular-automata kernel embedding of your simulation logic, would let you do MC just fine.