Hacker News new | ask | show | jobs
by CoolGuySteve 2830 days ago
Why are RT cores so different than normal shader cores? What instructions/memory fetch does a ray trace operation do that couldn't be implemented as an added instruction set on the shader cores to navigate the volume tree?

From the article, the best I can see is the following, but can't that be solved with microcode or as an extra rendering pipeline stage?

> In comparison, traversing the BVH in shaders would require thousands of instruction slots per ray cast, all for testing against bounding box intersections in the BVH

I ask, because having more slightly larger general purpose cores seems better for traditional rendering and raytracing than dedicating all that die space to pure single-purpose RT cores.

2 comments

RT cores are different because raytracing wants AoS (array of structures) rather than SoA (structure of arrays).

Let's look at the ALU perspective. Normal shader cores are essentially SoA: all ALU operations operate on 32 (NVidia) or 64 (AMD) items / threads at a time.

Implementing a ray-box intersection requires 6 multiply-adds to determine the intersection-time of the ray with each box plane, plus a bunch of comparisons to determine whether and when you hit the box. So if you're walking a standard (binary) BVH, you need 12+x ALU instructions (roughly equal to cycles) to handle one step of a wave / warp.

The picture is still fairly rosy when you start out your BVH walk, but then you get ray divergence. Some of your rays may finish early, some rays may want to do ray-triangle intersection instead. This means that only some of the SIMD lanes will be active and your ALU utilization drops. You're using the same number of cycles, but get much lower bang / buck.

In a dedicated RT core, you can operate one ray at a time instead of one instruction at a time. So you can do all multiply-adds for intersecting a single ray with both boxes in your BVH node in a single cycle, and then follow up with the comparisons in the remainder of your pipeline.

The upshot is that when rays diverge, you can still fully utilize the ALU units in your ray-box intersection pipeline -- it simply takes you fewer cycles to process all rays in a warp.

A similar argument applies to the memory system as well -- due to ray divergence, you obviously want to store your BVH nodes as AoS. A BVH node requires 12 floats to store the dimensions of two boxes, plus some space for child node links, which makes 64 bytes a natural node structure size, and you want to keep it contiguously in memory so that loading one node means loading (part of) one cacheline. But this makes it difficult to get the data through a normal shader core's load unit, which is optimized for SoA.

It is more like a texture unit than a shader core. Tree traversal is a pointer chasing problem, where the CPU/shader core executes a few instructions, then starts a memory load and then sits idle for tens or hundreds of clock cycles waiting for memory. Cache prefetching can help but is usually not a good fit for tree traversal where there is very little computation per node.

It is all about memory latency hiding and not really about computation.

But GPU cores are already king at latency hiding. They can run hundreds of threads doing pointer chasing, switching between them round-robin as the memory reads complete.
The switching isn't free. Waking up a thread to do just a few computation cycles (a few ray-aabb intersections) and then going back to sleep while waiting for the next node to be fetched from the memory is super inefficient.

If there was significant computation needed per node, this wouldn't be an issue.

> The switching isn't free.

It absolutely is, on current GPUs. Think of it like a larger-scale version of SMT (Intel's hyperthreading). GPUs are able to do this because they execute instructions in-order and do not need to track thousands of instructions per thread.

It's more complex than that. Switching warps thrashes your caches. There is definitely a cost associated with it.
Well, yeah. If you are memory bandwidth-constrained it's a bad idea to go off-chip.

But for ray-tracing, what does it really matter? We are already assuming that you will wait a full memory fetch cycle to get the next node's child AABBs and child indices. The warps will do their intersection test on the data they just read and fire off the next read. Each thread's hot context should fit in under a cache line, since it's basically just a single ray to keep track of.

IIRC it costs you a cycle to switch warps on Maxwell, but I'm not completely sure.