| HN Mirror

This paper ported a CFD application, which had a tuned CUDA implementation, to std::par: https://arxiv.org/pdf/2010.11751.pdf .

In Table 3, first and last columns shows the performance of CUDA and std::par in % of theoretical peak.

The rows show results for different GPU architectures.

On V100, CUDA achieves 62% theoretical peak and std::par 58%.

The amount of developer effort required to achieve over 50% theoretical peak with std::par makes it a no brainer IMO.

If there is one kernel where you need more performance, you can always implement that kernel in CUDA, but for 99% of the kernels in your program your time might be better spent elsewhere.