Hacker News new | ask | show | jobs
by SaintSeiya 651 days ago
Honest question: It is calim that software rasterizer is faster than hardware one. Can someone explain me why? isn't the purpose of the GPU to accelerate rasterization itself? Unless is a recent algorithm or the "software rasterizer" is actually running on the GPU and not the CPU I don't see how
6 comments

I'm a bit out of the GPU game but so this might be slightly wrong in some places: the issue is in small triangles because you end up paying a huge cost. GPUs ALWAYS shade in 2x2 blocks of pixels, not 1x1 pixels.

So if you have a very small triangle (small as in how many pixels on the screen it covers) that covers 1 pixel you will still pay the price of a 2x2 block (4 pixels instead of 1), so you just wasted 300% of your performance.

Nanite auto-picks the best triangle to minimize this and probably many more perf metrics that I have no idea about.

So even if you do it in software the point is that if you can get rid of that 2x2 block penalty as much as possible you could be faster than GPU doing 2x2 blocks in hardware since pixel shaders can be very expensive.

This issue gets worse the larger the rendering resolution is.

Nanite then picks larger triangles instead of those tiny 1-pixel ones since those are too small to give any visual fidelity anyway.

Nanite is also not used for large triangles since those are more efficient to do in hardware.

> So even if you do it in software the point is that if you can get rid of that 2x2 block penalty as much as possible you could be faster than GPU doing 2x2 blocks in hardware since pixel shaders can be very expensive.

Of course the obvious problem with that is if you don't have most of the screen covered in such small triangles then you're paying a large cost for nanite vs traditional means.

Nanite has an heuristic to decide between pixel-sized compute shader rasterizing and fixed-function rasterizing. You can have screen-sized quads in Nanite and it's fine
A couple reasons

1. HW does 2x2 blocks of pixels always so it can have derivatives, even if you don't use them..

2. Accessing SV_PrimitiveID is surprisingly slow on Nvidia/AMD, by writing it out in the PS you will take a huge perf hit in HW. There are ways to work around this, but they aren't trivial and differ between vendors, and you have to be aware of the issue it in the first place! I think some of the "software" > "hardware" raster stuff may come from this.

The HW shader in this demo looks wonky though, it should be writing out the visibility buffer, and instead it is writing out a vec4 with color data, so of course that is going to hurt perf. Way too many varyings being passed down also.

In a high triangle HW rasterizer you want the visibility buffer PS do a little compute as possible, and write as little as possible, so it should only have 1 or 2 input varyings and simply writes them out.

What's PS? Pixel shader? I'm guessing here.
Yes, correct
The answer to that is in this hour-long SIGGRAPH video.[1] Some of the operations needed are not done well, or at all, by the GPU.

[1] https://www.youtube.com/watch?v=eviSykqSUUw

Here's the relevant part of the (really cool!) video: https://www.youtube.com/watch?v=eviSykqSUUw&t=1888s
I'm also curious. From what I could read in the repository's references, I think that the problem is that the GPU is bad at rasterizing small triangles. Apparently each triangle in the fixed function pipeline generates a batch of pixels to render (16 in one of the slides I saw), so if the triangle covers only one or two pixels, all others in the batch are wasted. I speculate that the idea is to then detect these small triangles and draw them quickly using less pixel shaders (still on the GPU, but without using the graphics specific fixed functions), but I'm honestly not sure I understand what's happening.
I thought it was a software rasterizer running inside fragment shader on the GPU. Not actually on the CPU. I need to watch that video again to be sure, but I cant see how a CPU could handle that many triangles.
To be precise, this is running in a compute shader (rasterizeSwPass.wgsl.ts for the curious). You can think of that as running the GPU in a mode where it's a type of computer with some frustrating limitations, but also the ability to efficiently run thousands of threads in parallel.

This is in contrast to hardware rasterization, where there is dedicated hardware onboard the GPU to decide which pixels are covered by a given triangle, and assigns those pixels to a fragment shader, where the color (and potentially other things) are computed, finally written to the render target as a raster op (also a bit of specialized hardware).

The seminal paper on this is cudaraster [1], which implemented basic 3D rendering in CUDA (the CUDA of 13 years ago is roughly comparable in power to compute shaders today), and basically posed the question: how much does using the specialized rasterization hardware help, compared with just using compute? The answer is roughly 2x, though it depends a lot on the details.

And those details are important. One of the assumptions that hardware rasterization relies on for efficiency is that a triangle covers dozens of pixels. In Nanite, that assumption is not valid, in fact a great many triangles are approximately a single pixel, and then software/compute approaches actually start beating the hardware.

Nanite, like this project, thus actually uses a hybrid approach: rasterization for medium to large triangles, and compute for smaller ones. Both can share the same render target.

[1]: https://research.nvidia.com/publication/2011-08_high-perform...

thanks all, yes it start making sense now