Naive tree traversal on a GPU actually has pretty bad performance, due to execution divergence. It takes a lot of application-specific reframing of the problem to making working with BVH trees efficient: https://developer.nvidia.com/blog/thinking-parallel-part-ii-...
Its not as complicated as it sounds. Stream compaction solves execution divergence. The end. Instead of recursively searching the tree, you select the members of the tree with a child.
No, you can't do naïve recursion for this. GPUs just don't do that very well. But break it up with stream compaction, and everything is cake.
Its not the memory-link latency that gets you here. Its branch divergence. Solve branch divergence, and then you're far faster than a CPU at traversing that BVH tree. Even without Raytracing Hardware. Even with lol 1000ns latency per node = node->next (GPUs turn out to be decent at latency hiding if you up that occupancy a bit... and just double-check on the compiler / assembly language stuff to ensure that the access was rearranged to a sane location).
No, you can't do naïve recursion for this. GPUs just don't do that very well. But break it up with stream compaction, and everything is cake.
http://www.cse.chalmers.se/~uffe/streamcompaction.pdf
----------
Its not the memory-link latency that gets you here. Its branch divergence. Solve branch divergence, and then you're far faster than a CPU at traversing that BVH tree. Even without Raytracing Hardware. Even with lol 1000ns latency per node = node->next (GPUs turn out to be decent at latency hiding if you up that occupancy a bit... and just double-check on the compiler / assembly language stuff to ensure that the access was rearranged to a sane location).