|
|
|
|
|
by obl
1202 days ago
|
|
In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order. |
|
I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.
I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.