| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jasode 2750 days ago

>avx-512 (nobody uses)

An example of people using those cpu instructions would be buyers of Intel's proprietary C++/Fortran compiler.[1] The reason companies pay ~$1700 license instead of using free compilers such as GCC and Clang is to specifically take advantage of the latest advanced Intel cpu instructions.

Example buyers of Intel's C++ compiler would include high-frequency trading firms and HPC labs. I wouldn't be surprised if Google, Facebook, and Amazon also bought Intel Parallel Studio compiler licenses for some of their workloads.

[1] https://software.intel.com/en-us/parallel-studio-xe/support/...

2 comments

celrod 2750 days ago

Recent LLVM and GCC make good use of avx512. Although for GCC, you need to use the option -mprefervectorwidth=512.

Unfortunately benchmarks on websites like Phoronix do not make use of them. But when it comes to numerical computing, it is a boon. Much easier to take advantage of than the GPU.

I run (Monte Carlo) simulations that take hours or days. These can be vectorized, but I've never heard of someone being able to run them on a GPU.

However, folks bring graphics cards up every time I mention (my love of) avx512. There is always a first, so I do really want to find the time to play around with it, and see how many mid-sized chunks can be woven together. And how memory/cache plays out when breaking things into small pieces.

The last Monte Carlo simulation I ran took a few days to get 100 iterations. The MC iterations themselves were chains of Markov Chain Monte Carlo iterations. Each of these MCMC iterations takes several seconds.

Therefore, to move to a GPU, I'd like to parallelize between MC iterations, and also within the MCMC iterations.

On a CPU all you have to do is vectorize the MCMC iterations, and then run the chains in parallel.

link

derefr 2750 days ago

> I've never heard of someone being able to run them on a GPU

I'm surprised; intuitively (though, mind you, as someone who has never done GPGPU programming, only read articles about it), I'd think some combination of 1. a CPU-RNG-seeded simplex-noise kernel, for per-core randomness; and 2. a cellular-automata kernel embedding of your simulation logic, would let you do MC just fine.

link

celrod 2750 days ago

I've often brought up GPUs while talking to folks, because they're interesting and offer a world of potential.

I have two computers, one with a Ryzen 1950X, and the other an i9 7900X. Both CPUs cost about the same, but the i9 (with avx-512) is close to 4 times faster at matrix multiplication. Yet it is still about 10x slower than a cheaper Vega 64 GPU.

But the folks I talk to aren't generally computer scientists. They're statisticians and academics, mostly. A few have tried, but they haven't been successful.

There are libraries like rocRAND / cuRAND for random number generators.

It's probably possible, and I just need to sit down and really experiment. For the MCMC chains (going on within MC), Hamiltonian Monte Carlo sounds more feasible than Gibbs sampling. In Gibbs sampling, you need lots of different conditional random numbers. You often get these from accept/reject algorithms -- ie, lots of fine grained control flow. And ideally, each MCMC run has at least an entire work group dedicated to it. You don't want the entire work group calculating a small handful of gamma random number (with all the rest masked). The parameters of the gammas are not known in advance, so they cannot be pre-sampled.

Hamiltonian Monte Carlo is probably much friendly. However, I have heard concerns that the simplectic integrator used needs a high degree of accuracy to avoid diverging. That is, that it needs 64 bits of precision.

GPUs with more than 32 bits are well outside of my budget. Although, I could look into tricks like double-singles for the accuracy-critical parts of the computation.

The simulation I mentioned in my previous comment was using Hamiltonian Monte Carlo. However, each iteration was rather involved, and while much is vectorizable (eg, matrix factorizations and inversions), doing so on a GPU is AFAIK not trivial. It seems like a gigantic leap in complexity.

link

BeeOnRope 2750 days ago

clang and gcc supported AVX-512 before any chips using it were even available.

icc might still have an edge on vectorized code, but it is not that big.

link

flamedoge 2750 days ago

Parallel studio is more than just latest ISA support though. VTunes tells you all about performance at the processor level, so you get much more accurate perf profiles from the hardware itself.

link

gregdunn 2750 days ago

You can also get VTune as part of System Studio through a 90 day perpetually renewable community support license - https://software.intel.com/en-us/system-studio/choose-downlo...

No talking to Intel's engineers, but kind of cool if you're alright going without a real support channel.

link