Hacker News new | ask | show | jobs
by maxwell86 1496 days ago
lol please no

If you are using C++, and want to parallelize something, just add "std::execution::par" to your algorithms.

Instead of writing "std::for_each(...)" just write "std::for_each(std::execution::par, ...)".

That's it. It really is that simple. And with the right compilers you can just compile the code to run on FPGAS, GPUs, or whatever.

For someone that knows C++, doing that is the lowest barrier of entry, and gets you most of the way there without having to learn "some other programming language" like OpenMP (or anything else).

3 comments

OpenMP can parellize a for loop like:

    #pragma openmp parallel for
    for(int i=0; i<1000000; i++)
        C[i] = A[i] + B[i];
It's normal C++ and the pragma auto-parallelizes the loop. It's actually really easy and convenient.

OpenMP is probably the easiest join/fork model of parallelism on any C++ system I've used. It doesn't always get the best utilization of your CPU cores, but it's really, really simple.

It's the best way to start IMO, far easier than std::thread, or switching to functional style for other libraries. Just write the same code as before but with a few #pragma omp statements here and there.

> It's normal C++

C++ is an ISO standard, you can use it _everywhere_, in space, in automotive, in aviation, in trains, in medical devices, _everywhere_.

OpenMP is not C++, it is a different programming language than C++.

OpenMP is not an ISO standard, you can't use it in _most_ domains that you can use C++.

Your example:

    #pragma openmp parallel for
    for(int i=0; i<1000000; i++)
        C[i] = A[i] + B[i];
shows how bad OpenMP is.

It does not run in parallel on GPUs or on FPGAS (lacking target offload directives), and you can't use it on most domains in which you can use C++.

The following is ISO standard C++:

    std::for_each_n(std::execution::par, std::ranges::iota(0).begin(), 1000000, [](int i) {
        C[i] = A[i] + B[i];
    });
it runs in _parallel_ EVERYWHERE: GPUs, CPUs, FPGAS, and it is certified for all domains for which C++ is (that is: all domains).

Show me how to sort an array in parallel on _ANY_ hardware (CPUs, GPUs, FPGAs) with OpenMP. With C++ is as simple as:

    std::sort(std::execution::par, array.begin(), array.end());
If you have a GPU, this offloads to the GPU. If you have an FPGA, this offloads to the FPGA. If you have a CPU with 200 cores, this uses those 200 cores.

There is no need to turn your ISO C++ compliant program or libraries into OpenMP. That prevents them from being used by many domains on which C++ runs on. It also adds an external dependency for parallelism, for no good reason.

For any problem that OpenMP can solve, OpenMP is _always_ a worse solution than just using strictly ISO standard and compliant C++.

OpenMP has completely lost a reason to exist. It's not 1990 anymore.

You sure know how to ruin a good thing with bad demeanor. When someone likes something and you say they shouldn't like it because it does exactly the same thing in a different way, you are actively driving people away from the better thing.
It's ok to like OpenMP.

What I disagree with is that it should be suggested to beginners as the way to parallelize their C++ programs.

That's like telling a Javascript programmer that they should parallelize their programs by using Python or C.

Show them how to do it in Javascript, or in this case, in C++, so that they don't have to learn a whole new programming model or language to just write parallel code.

Particularly when C++ has supported this for so long now.

OpenMP is a set of #pragma that just sit in your C++ code directly.

> What I disagree with is that it should be suggested to beginners as the way to parallelize their C++ programs.

I guess we can agree to disagree then. If beginners think your way is easier, they're welcome to try. But there's plenty of production code examples that show the simplicity of OpenMP.

        #pragma openmp target parallel for
The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

OpenMP is just one tool in my toolbox. To be honest, I've found it to be not flexible enough for most of my usage, but its gross simplicity is again, one of the easiest C++ / C tools I've ever used. Yes, even for playing or dabbling in GPGPU programming.

Furthermore, OpenMP is usable on GCC, Clang. Its even available (OpenMP2.0 at least) on MSVC++ (though OMP 2.0 leaves much to be desired, that's still enough for some degree of programming on Windows). So OpenMP code on say, Blender (3d raytracing program) runs on pretty much all important C++ platforms.

GPU and FPGA programming is complicated to actually perform well, because GPUs and FPGAs have a huge PCIe 3.0 bottleneck. A lot of code in CPU-land can stay in L1, L2, or L3 cache and outperform the PCIe-transfer alone. In contrast, CPU-to-CPU transfers are very quick (and exist on the L3 to L3 transfer or L2 to L2 transfer speeds), so your "cost of communication" is very low. I don't want to discourage any beginner from playing with GPU code (especially if they're "just messing around"). GPU code is easier to write than most expect.

But its surprisingly difficult to actually beat CPU code with GPU-offload code.

If its not something that works out for you, that's fine I guess? There's a lot of different tools for a lot of different situations.

--------

A fun OMP thing btw, is...

        #pragma openmp parallel for simd
Which (tries to) compile your program into SIMD code, like AVX512 or NEON for ARM.

OMP isn't as flexible as writing your own threads by hand, but its easy to experiment with many forms of parallelism with the same code. There's also nifty attributes, like "firstprivate" or "collapse", or "reduction" clause... as well as having different schedulers (static, dynamic, guided).

Honestly, its really good for prototyping. You write one for-loop, but have all these knobs and dials to try out a bunch of different strategies. But for "final code", hand-crafted threaded code really can't be beaten.

--------

BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

> The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

> BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

> The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

Yes. That's the bottleneck and difficulty of GPU / FPGA programming. Knowing where your memory is. PCIe is very high latency, especially compared to L1, L2, L3, or DDR4 RAM.

Look, even in NVidia's "Thrust" library, you want to very carefully be thinking about CPU vs GPU RAM. If your operations are primarily on the CPU-side, you want a CPU-malloc. If your operations are primarily on GPU-side, you want a GPU-malloc.

Modern PCIe can "handle the details" for you, but if your GPU memory accesses all go through the PCIe bus to read CPU-DDR4 RAM to do anything, it will be simply slower than using the CPU itself.

This isn't a beginner subject anymore, not in the slightest.

> They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

I severely doubt that std::for_each_n exists on GPU code.

I see that for_each_n exists in NVidia's "Thrust" library, which is also a beginner-level API / library to use in the CUDA system (but not as efficient as dedicated GPU code). And I can imagine that NVidia Thrust might be compatible with more recent C++ standards.

But I cannot imagine the underlying API to know whether or not to do a CPU-malloc or GPU-malloc efficiently. And I'm not seeing any std::api that handles this detail. (NVidia Thrust has the programmer explicitly call whether you're using a device_vector vs host_vector).

-------

The __ONLY__ API that ever tried to "automagically" figure out the CPU-malloc vs GPU-malloc issue was C++ AMP by Microsoft. It was interesting, but performance issues and DirectX11 compatibility prevented progress (when DX12 GPUs came out, the C++AMP project didn't keep up).

I liked their array_view abstraction and its "automagic" at trying to figure out this memory-management issue. But... I really haven't seen anything like that since C++AMP.

> I severely doubt that std::for_each_n exists on GPU code.

https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algor...

This is 4 years old. Been using it in production for the last 2 years. Works fine.

Pretty much everyone I've talked to using this in production from other research groups was able to remove all their CUDA code and replace it with this without any performance hit.

There are some recent publications about this, but most of them are quite old right now cause this is not new anymore: https://arxiv.org/abs/2010.11751

OpenMP code can be compiled as single threaded on compilers that don't support it without code changes. It's not a language but more like a set of annotations to be added.

I was not aware of C++ having something similar. Is that a new feature?

Edit: YES it's C++17 and later: https://en.cppreference.com/w/cpp/algorithm/execution_policy...

> If you are using C++, and want to parallelize something, just add "std::execution::par" to your algorithms.

Do any of the shipping standard libraries actually implement execution policies? I only use gcc and clang so have to resort to TBB to get this capability.

>> Do any of the shipping standard libraries actually implement execution policies? I only use gcc and clang so have to resort to TBB to get this capability.

Looks like it's C++17 and C++20 feature:

https://en.cppreference.com/w/cpp/algorithm/execution_policy...