Hacker News new | ask | show | jobs
by dahart 2956 days ago
Great write-up! Your speedups look like they line up as expected. Would you happen to know why Aras's look a bit weird, is it the devices he's using? I'm referring to his Metal timings being more than 10x slower than D3D, and being almost comparable to C++.

> One of my goals was “use as much of the same code as possible on the CPU and GPU”.

Great idea, it's hard to debug GPU code, and super convenient to have something that you can run & debug on the CPU, then just flip a switch and get GPU speedups!

> Another obvious limitation is that GPU code cannot recurse.

Yep. I bet this changes relatively soon, since you can sometimes use recursion in OpenCL & CUDA.

It's a fun exercise in ShaderToy to write a recursive ray tracer. Here's mine: https://www.shadertoy.com/view/XllBRf

And here's a much better one: https://www.shadertoy.com/view/4scfz4

3 comments

> Great write-up! Your speedups look like they line up as expected. Would you happen to know why Aras's look a bit weird, is it the devices he's using? I'm referring to his Metal timings being more than 10x slower than D3D, and being almost comparable to C++.

Thanks for reading! Since Aras is using a GTX 1080 TI on Windows, and an Intel Iris Pro on Mac, I think the numbers make sense. In my case, my GTX 770 is pretty old, and my 2017 MacBook is pretty new, so the numbers line up better.

> It's a fun exercise in ShaderToy to write a recursive ray tracer. Here's mine: https://www.shadertoy.com/view/XllBRf

> And here's a much better one: https://www.shadertoy.com/view/4scfz4

Nice! I should add accumulation to mine -- it'd look much better for barely any effort.

> I'm referring to his Metal timings being more than 10x slower than D3D

In my case it's the difference in hardware. DX11 results are on GTX 1080 Ti, whereas Mac are on Iris Pro on 2013 MacBookPro. > 10x performance difference between these GPUs is entirely expected.

Totally makes sense; thanks for the explanation!
> Yep. I bet this changes relatively soon, since you can sometimes use recursion in OpenCL & CUDA.

Am I correct in assuming you can only recurse when the recursion can be "unrolled" into a simple loop? In other words, no program flow structures.

You can do real recursion sometimes, and real program flow. It does depend on CUDA version & chipset, but most modern ones support flow control. (https://stackoverflow.com/questions/3644809/does-cuda-suppor...)

I think there are still cases where it won’t work right due to other libraries that might be involved, I’ve had recursion fail even on a new GPU with cuda 9.

Even if you can use recursion, it’s not usually a good idea since you’ll run into thread divergence problems.