|
|
|
|
|
by maxwell86
1495 days ago
|
|
> The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you. The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well. > BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs. They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it. |
|
Yes. That's the bottleneck and difficulty of GPU / FPGA programming. Knowing where your memory is. PCIe is very high latency, especially compared to L1, L2, L3, or DDR4 RAM.
Look, even in NVidia's "Thrust" library, you want to very carefully be thinking about CPU vs GPU RAM. If your operations are primarily on the CPU-side, you want a CPU-malloc. If your operations are primarily on GPU-side, you want a GPU-malloc.
Modern PCIe can "handle the details" for you, but if your GPU memory accesses all go through the PCIe bus to read CPU-DDR4 RAM to do anything, it will be simply slower than using the CPU itself.
This isn't a beginner subject anymore, not in the slightest.
> They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.
I severely doubt that std::for_each_n exists on GPU code.
I see that for_each_n exists in NVidia's "Thrust" library, which is also a beginner-level API / library to use in the CUDA system (but not as efficient as dedicated GPU code). And I can imagine that NVidia Thrust might be compatible with more recent C++ standards.
But I cannot imagine the underlying API to know whether or not to do a CPU-malloc or GPU-malloc efficiently. And I'm not seeing any std::api that handles this detail. (NVidia Thrust has the programmer explicitly call whether you're using a device_vector vs host_vector).
-------
The __ONLY__ API that ever tried to "automagically" figure out the CPU-malloc vs GPU-malloc issue was C++ AMP by Microsoft. It was interesting, but performance issues and DirectX11 compatibility prevented progress (when DX12 GPUs came out, the C++AMP project didn't keep up).
I liked their array_view abstraction and its "automagic" at trying to figure out this memory-management issue. But... I really haven't seen anything like that since C++AMP.