#pragma openmp target parallel for
The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.OpenMP is just one tool in my toolbox. To be honest, I've found it to be not flexible enough for most of my usage, but its gross simplicity is again, one of the easiest C++ / C tools I've ever used. Yes, even for playing or dabbling in GPGPU programming. Furthermore, OpenMP is usable on GCC, Clang. Its even available (OpenMP2.0 at least) on MSVC++ (though OMP 2.0 leaves much to be desired, that's still enough for some degree of programming on Windows). So OpenMP code on say, Blender (3d raytracing program) runs on pretty much all important C++ platforms. GPU and FPGA programming is complicated to actually perform well, because GPUs and FPGAs have a huge PCIe 3.0 bottleneck. A lot of code in CPU-land can stay in L1, L2, or L3 cache and outperform the PCIe-transfer alone. In contrast, CPU-to-CPU transfers are very quick (and exist on the L3 to L3 transfer or L2 to L2 transfer speeds), so your "cost of communication" is very low. I don't want to discourage any beginner from playing with GPU code (especially if they're "just messing around"). GPU code is easier to write than most expect. But its surprisingly difficult to actually beat CPU code with GPU-offload code. If its not something that works out for you, that's fine I guess? There's a lot of different tools for a lot of different situations. -------- A fun OMP thing btw, is... #pragma openmp parallel for simd
Which (tries to) compile your program into SIMD code, like AVX512 or NEON for ARM.OMP isn't as flexible as writing your own threads by hand, but its easy to experiment with many forms of parallelism with the same code. There's also nifty attributes, like "firstprivate" or "collapse", or "reduction" clause... as well as having different schedulers (static, dynamic, guided). Honestly, its really good for prototyping. You write one for-loop, but have all these knobs and dials to try out a bunch of different strategies. But for "final code", hand-crafted threaded code really can't be beaten. -------- BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs. |
The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.
> BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.
They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.