| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragontamer 1542 days ago

        #pragma openmp target parallel for

The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

OpenMP is just one tool in my toolbox. To be honest, I've found it to be not flexible enough for most of my usage, but its gross simplicity is again, one of the easiest C++ / C tools I've ever used. Yes, even for playing or dabbling in GPGPU programming.

Furthermore, OpenMP is usable on GCC, Clang. Its even available (OpenMP2.0 at least) on MSVC++ (though OMP 2.0 leaves much to be desired, that's still enough for some degree of programming on Windows). So OpenMP code on say, Blender (3d raytracing program) runs on pretty much all important C++ platforms.

GPU and FPGA programming is complicated to actually perform well, because GPUs and FPGAs have a huge PCIe 3.0 bottleneck. A lot of code in CPU-land can stay in L1, L2, or L3 cache and outperform the PCIe-transfer alone. In contrast, CPU-to-CPU transfers are very quick (and exist on the L3 to L3 transfer or L2 to L2 transfer speeds), so your "cost of communication" is very low. I don't want to discourage any beginner from playing with GPU code (especially if they're "just messing around"). GPU code is easier to write than most expect.

But its surprisingly difficult to actually beat CPU code with GPU-offload code.

If its not something that works out for you, that's fine I guess? There's a lot of different tools for a lot of different situations.

--------

A fun OMP thing btw, is...

        #pragma openmp parallel for simd

Which (tries to) compile your program into SIMD code, like AVX512 or NEON for ARM.

OMP isn't as flexible as writing your own threads by hand, but its easy to experiment with many forms of parallelism with the same code. There's also nifty attributes, like "firstprivate" or "collapse", or "reduction" clause... as well as having different schedulers (static, dynamic, guided).

Honestly, its really good for prototyping. You write one for-loop, but have all these knobs and dials to try out a bunch of different strategies. But for "final code", hand-crafted threaded code really can't be beaten.

--------

BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

1 comments

maxwell86 1542 days ago

> The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

> BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

link

dragontamer 1542 days ago

> The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

Yes. That's the bottleneck and difficulty of GPU / FPGA programming. Knowing where your memory is. PCIe is very high latency, especially compared to L1, L2, L3, or DDR4 RAM.

Look, even in NVidia's "Thrust" library, you want to very carefully be thinking about CPU vs GPU RAM. If your operations are primarily on the CPU-side, you want a CPU-malloc. If your operations are primarily on GPU-side, you want a GPU-malloc.

Modern PCIe can "handle the details" for you, but if your GPU memory accesses all go through the PCIe bus to read CPU-DDR4 RAM to do anything, it will be simply slower than using the CPU itself.

This isn't a beginner subject anymore, not in the slightest.

> They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

I severely doubt that std::for_each_n exists on GPU code.

I see that for_each_n exists in NVidia's "Thrust" library, which is also a beginner-level API / library to use in the CUDA system (but not as efficient as dedicated GPU code). And I can imagine that NVidia Thrust might be compatible with more recent C++ standards.

But I cannot imagine the underlying API to know whether or not to do a CPU-malloc or GPU-malloc efficiently. And I'm not seeing any std::api that handles this detail. (NVidia Thrust has the programmer explicitly call whether you're using a device_vector vs host_vector).

-------

The __ONLY__ API that ever tried to "automagically" figure out the CPU-malloc vs GPU-malloc issue was C++ AMP by Microsoft. It was interesting, but performance issues and DirectX11 compatibility prevented progress (when DX12 GPUs came out, the C++AMP project didn't keep up).

I liked their array_view abstraction and its "automagic" at trying to figure out this memory-management issue. But... I really haven't seen anything like that since C++AMP.

link

maxwell86 1536 days ago

> I severely doubt that std::for_each_n exists on GPU code.

https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algor...

This is 4 years old. Been using it in production for the last 2 years. Works fine.

Pretty much everyone I've talked to using this in production from other research groups was able to remove all their CUDA code and replace it with this without any performance hit.

There are some recent publications about this, but most of them are quite old right now cause this is not new anymore: https://arxiv.org/abs/2010.11751

link