There is no such thing as 'failure to autovectorize' in CUDA or OpenCL. All code is vectorized in these languages. SIMD is fundamental to the language model.
As such, it's easier to write high performance code in practice. Intel's ISPC is the closest tool that replicates this effect for AVX512.
As such, it's easier to write high performance code in practice. Intel's ISPC is the closest tool that replicates this effect for AVX512.