In LLVM or in libomp?
I don't know what omp simd is likely to get you over autovectorization. I know of cases where it was thought necessary (-fopenmp-simd, without -fopenmp) but wasn't with recent GCC.
"#pragma omp declare simd" applies over a function call, which then allows that function to be used inside of a "#pragma omp for simd" loop.
A few keywords here and there really help the autovectorizer achieve closer to CUDA-like environments (like... actually having your SIMD code extend "through" a function call, so you can start splitting up the work a bit better).
I took the example program from the OpenMP standard and built it with GCC 11 -Ofast. -fopt-info said the relevant loop was vectorized. Adding -fopenmp gave more vectorization messages from elsewhere, but I don't have time to figure out the difference from the tree dump (not being good with assembler). Doubtless the directives can help, but you do need to get them right, and I trust GCC more than me!
"#pragma omp declare simd" applies over a function call, which then allows that function to be used inside of a "#pragma omp for simd" loop.
A few keywords here and there really help the autovectorizer achieve closer to CUDA-like environments (like... actually having your SIMD code extend "through" a function call, so you can start splitting up the work a bit better).
EDIT: Here's an example from Intel's ICC: https://software.intel.com/content/www/us/en/develop/documen...