Hacker News new | ask | show | jobs
by dragontamer 1789 days ago
The task-based parallelism in LLVM leaves much to be desired however. Ideally, you'd want a more efficient implementation.

But yeah, good enough to play with. But maybe not good enough to achieve high levels of performance. The SIMD stuff is probably simple enough to implement... maybe I should checkout how well LLVM works with OMP SIMD keywords.

2 comments

Can you comment on experience (or contact me) regarding implementation efficiency? We have recently implemented task-based parallelism in the J language with openMP[0]. Improvements or critiques are appreciated. SIMD instructions there have been coded directly rather than via pragmas.

[0] https://www.monument.ai/m/parallel

I can't say that my critiques are based off of personal experience. But mostly about microbenchmarks I've read that other people have talked about. I am probably a bit out of date, since its been a while since I last played with OpenMP.

I'm looking at the benchmarks I used to look at, and they're all from 2014 or earlier. So maybe I really should double-check modern implementations. We all know GCC 4.x and LLVM 3.x are an eternity ago, so I probably should revisit their performance.

For example: https://www.phoronix.com/scan.php?page=article&item=llvm_cla...

And back then, it was pretty well known that OpenMP implementations were slower than commercial (such as Intel ICC or IBM's OpenMP implementation).

In LLVM or in libomp? I don't know what omp simd is likely to get you over autovectorization. I know of cases where it was thought necessary (-fopenmp-simd, without -fopenmp) but wasn't with recent GCC.
Autovectorization has issues with function calls.

"#pragma omp declare simd" applies over a function call, which then allows that function to be used inside of a "#pragma omp for simd" loop.

A few keywords here and there really help the autovectorizer achieve closer to CUDA-like environments (like... actually having your SIMD code extend "through" a function call, so you can start splitting up the work a bit better).

EDIT: Here's an example from Intel's ICC: https://software.intel.com/content/www/us/en/develop/documen...

I took the example program from the OpenMP standard and built it with GCC 11 -Ofast. -fopt-info said the relevant loop was vectorized. Adding -fopenmp gave more vectorization messages from elsewhere, but I don't have time to figure out the difference from the tree dump (not being good with assembler). Doubtless the directives can help, but you do need to get them right, and I trust GCC more than me!