Hacker News new | ask | show | jobs
by phire 761 days ago
Has it really been 9 years since I started working on Ubershaders?

I'm a little surprised no better solution has come along. Vulkan didn't even exist back then (and DirectX 12 had only just released) but instead of making things better, it digs it's feet even deeper into the assumption that all shaders will be known ahead of time (resulting in long "shader recompilation" dialogs on startup on many games).

I've been tempted to build my own fast shader compiler into Dolphin for many common GPU architectures. Hell, it wouldn't even be a proper compiler, more of a templated emitter as all shaders fit a pattern. Register allocation and scheduling could all be pre-calculated.

But that would be even more insane than ubershaders, as it would be one backend per gpu arch. And some drivers (like Nvidia) don't provide a way to inject pre-compiled shader binaries.

On the positive side, ubershaders do solve the problem, and modern GPU drivers do a much better job at accepting ubershaders than they did 9 years ago. Though that's primarily because (as far as I'm aware) examples of Dolphin's ubershader have made their way into every single shader compiler test suite.

4 comments

>On the positive side, ubershaders do solve the problem, and modern GPU drivers do a much better job at accepting ubershaders than they did 9 years ago. Though that's primarily because (as far as I'm aware) examples of Dolphin's ubershader have made their way into every single shader compiler test suite.

How'd that come to be? Just interesting code for test suites or did you guys advocate for it to be included?

Don't you think intermediate representation like SPIR-V would suffice in mostly eliminating stutter? Yuzu used that and shader stutter seemed to be minimal and I can image that the shaders generated by Yuzu are much more complex than Dolphin.
The only step that SPIR-V replaces is parsing the GLSL to an AST tree, and that's only a small part of the total time to compile a shader. Usually the bottleneck is Register allocation or scheduling.

Back when Vulkan was developed, there were a bunch of OpenGL drivers out there which had random AST parsing bugs (Dolphin even has a bunch of workarounds for them); So a large chunk of the motivation for SPIR-V was avoiding the need for every driver to implement their own GLSL parser and the associated bugs.

The problem for Dolphin is not the complexity of the shader, but the quantity.

Shaders in modern games are usually written manually (or authored in a shader node editor by an artist), so it's rare for a game to have more than a few thousand total. Better games might only have a few dozen for the entire game.

But because gamecube/Wii games configure the TEV pixel pipeline though a dynamic API, some games use that API in a pattern where Dolphin can find itself generating hundreds of shaders per second. Some games even manage to generate new shaders continually as you play, because they append junk state to their pixel pixeline state which dolphin doesn't detect as a duplicate.

Shaving off the frontend costs is not going to be nothing. I don't know if Dolphin is still using FXC/D3DCompile or if they've switched to DXC, but FXC is infamously slow, even for very simple shaders. Dolphin's shaders are medium-complexity IIRC, so I'd expect removing the frontend to be a decent win.

The driver PSO compilers aren't amazing but they're also not terrible. Most games do some form of hash-n-cache for PSO compilation and while stutters are still an issue, it's not the worst in the world. With the frontend gone, I'd expect ~50 shaders per second to be roughly stutter-free.

Being smarter about specialization is probably a good idea -- having a blend between "GPU interpreter" and "full specialized pipeline" is where I think you should head. Several of the weirder TEV features could probably be moved to branching on dynamic buffer contents.

Not to mention using newer features like bindless to merge draw calls. I always wanted to do that but got too busy before I stopped working on Dolphin :)

> so I'd expect removing the frontend to be a decent win

I did some testing before working on ubershaders, and my modified build which cached the bytecode output of FXC/D3DCompile (whatever dolphin was using at the time) didn't reduce the stuttering by enough to be worth the effort of optimising the frontend.

My conclusion was that it's simply wasn't worth any effort to optimise for slightly smaller stutters, as they were still very perceivable to users. And Hybrid Ubershaders can hide any compile delays without any issue.

And this testing was with FXC/D3DCompile which does a bunch of optimisations. The fact that SPIR-V comes in (potentially) unoptimised means any vulkan compiler has to send it though all optimisation passes. Though I have been very tempted to do dead code removal before submitting the shaders, partly to make the shaders more readable to humans and partly to reduce the amount of code going though the various compiler passes.

> Being smarter about specialization is probably a good idea -- having a blend between "GPU interpreter" and "full specialized pipeline" is where I think you should head.

Yeah, that was always next on the list. Start with just ubershaders and then incrementally specialise on a background thread for the correct balance of shaders.

Dolphin's current specialised shaders are no-where near fully specialized. Need to go further by baking some of the constants and lookup textures into the shader.

Very cool work!

I had to solve a similar problem years ago, during the transition from fixed function to shaders, when shaders weren't as fast or powerful as today. We started out with an ubershader approximating the DX9/OpenGL 1.2 fixed functions, but that was too slow.

People in those days thought of rendering state being stored in a tree, like the transform hierarchy, and you ended up having unpredictable state at the leaf nodes, sometimes leading to a very high permutation of possible states. At the time, I decomposed all possible pipeline state into atomic pieces, eg, one light, fog function, texenv, etc. These were all annotated with inputs and outputs, and based on the state graph traversal, we'd generate a minimal shader for each particular material automatically, while giving old tools the semblance of being able to compose fixed function states. As for you, doing this on-demand resulted in stuttering, but a single game only has so many possible states - from what I've seen, it's on the order of a few hundred to a few thousand. Once all shaders are generated, you can cache the generated shaders and compile them all at startup time.

I wonder if something like this would work for emulating a Gamecube. You can definitely compute a signature for a game executable, and as you encounter new shaders, you can associate them with the game. Over time, you'll discover all the possible state, and if it's cached, you can compile all the cached shaders at startup.

Anyhow, fun stuff. I used to love work like this. I've implemented 3DFx's Glide API on top of DX ages ago to play Voodoo games on my Nvidia cards, and contributed some code to an N64 emulator named UltraHLE.

> contributed some code to an N64 emulator named UltraHLE

That's a blast from the past, I distinctly remember reading up about UltraHLE way back when and then trying it our and for the first time being able to play Ocarina of Time on my middle class PC with almost no issues, that was magical.

I still don’t understand why you didn’t use the precompiled shaders packed with the games… you’re emulating the GameCube or Wii GPU, and it’s never going to change, and the games provide precompiled shaders.
First, GameCube/Wii API actually generates the "shaders" at runtime, so there is simply no way to know which vertex/pixel pipeline states the game needs short of playing though the whole game, looking at every single bit of level geometry.

Many games actually dynamically generate new "shaders" on the fly, based on which lights are near an object, and in which order.

Second we can't use those vertex/pixel pipeline states directly on modern GPU, they need to be translated into modern shaders, and then compiled by the driver for your graphics card. It's actually that compile step which causes the stuttering, dolphin's translation is plenty fast enough.

The combination of these two facts means Dolphin can't depend on any pre-computation at all.

I don’t get it (this is not your fault, it’s mine) but I believe you.
if I understand correctly

1. "shader" is just a metaphor, the actual code running on the gamecube gpu is a custom pipeline that has a dynamic structure and is updated aggressively throughout the lifetime of the app - there is no static "shader" program to run on the host GPU.

2. The architectures of the gamecube and modern GPUs are so distinct as to require an intricate translation layer in order to map gamecube rendering operations to first class shader operations on a modern GPU. This very process causes the stuttering that starts the issue.

Translation is not intricate, but modern graphics card are not tuned for dynamically setting up shaders.
That's the trick, they actually don't provide precompiled shaders as you know them. The graphics hardware back then was fixed function pipelines with a tremendous number of options to configure how they work. The downside is that you can't run truly arbitrary code but the upside is that they can instantaneously switch behavior as fast as setting a register.

Prior to ubershaders the emulator took a configuration for the hardware pipeline and turned that into a shader, which took time to compile. Ubershaders work by emulating the entire fixed function pipeline in one glorious shader until the smaller, more efficient shader can be compiled and slipped in.

Basically, the ubershader is the only thing that can actually understand the "shaders" packaged with the game and start using them with zero latency.

Why not just precompile all the possible hardware combinations? There's far more combinations than atoms in the universe. Why not just precompile all the hardware combinations that the game actually uses? There's no way to tell before hand without examining every branch of the game's code which ranges in difficulty from "computationally prohibitive" to "fundamental theorems of how computers work says this is impossible".

The article mentions that some users actually passed around cached shader packs, but that solution was brittle.

Wait, I hought that's what the ubershaders are. What you say is what I kept thinking for much of the article - "just" emulate the GPU, no compiler needed. And then they did.
Maybe, lol, there were a lot of terms in that article that I didn’t understand well enough to claim I understood the article in toto.
One thing to remember is these older consoles don't have the same concept of a "shader" as we do today.

Go back far enough and you'll find the industry trying to settle on quads or triangles for rendering (and we all knew who won)

The games were given basically an immediate mode API into the graphics card and they could do whatever they wanted, whenever they wanted, without warning.

The stutter happened when they were translating the API mentioned above into modern GPU shaders.

When it was on the CPU - They had to determine the effect, generate and compile the modern shaders, and upload that to the GPU, sometimes hundreds of times a second. Then the GPU would take over and display.

Uber shaders took that entire pipeline and moved it into the GPU.

This was low level emulation, just still hitting limits of modern CPUs.

PC games that have a shader pre compile step usually have to re do it when new drivers come out, pre compiled shaders can be shipped to closed systems such as consoles or even steam deck but not for PC. Each different GPU brand requires different ones and like I said even when you update drivers.
They're precompiled for the console GPU architecture, not the PC architecture, so they can't be used directly and still need to be emulated - I think those precompiled shaders are the input to the ubershader.
The GAMES THEMSELVES are precompiled for the PowerPC architecture, not the PC architecture, though. That didn’t stop anyone from creating Dolphin.

GPUs (I’m told) have far fewer instructions to emulate than a CPU, so I’d think that low level emulation of the Flipper shaders would be no trouble. Can’t translate or transpile them to PC GPUs though because those instruction sets are somewhat secret, I think.

I know nothing about this stuff but I am a developer so perhaps I know enough to ask the most stupid questions possible.

It’s gotta be a performance thing, why they didn’t emulate Flipper at a low enough level to use the precompiled shaders directly.

> because those instruction sets are somewhat secret, I think

The GPU ISAs are known (e.g. the PTX compiler for NVidia is open source and has a backend in LLVM). The main problem is that the GPU ISA changes with every GPU hardware generation and manufacturer, so if you want to support Nvidia 3xxx + 4xxx + AMD VLIW + AMD GCN + ... you have to use the common demoninator GLSL/HLSL/SPIR-V/whatever.

> why they didn’t emulate Flipper at a low enough level to use the precompiled shaders directly.

They did. Originally the GPU emulator was done in the CPU, and in 2017, the GPU emulator itself was moved into a shader ("ubershader").

The console game itself does not include shaders in text format like many PC games do.

> The GPU ISAs are known (e.g. the PTX compiler for NVidia is open source and has a backend in LLVM)

PTX is only and IR afaik, kinda like SPIRV. It also goes through another compiler in the driver so doesn't really help here

The ubershader is the thing that emulates Flipper at a low enough level to use the precompiled "shaders" directly. Prior to that the precompiled "shaders" were examined and recompiled into individual shaders, a process that took time.

(Why "shaders" in quotes? Because they weren't shaders as we know them today but really more like lists of hardware flags for how to flow data through a fixed function pipeline)

Yes, that’s exactly the point though. This is the same question as why you can’t emulate a game by precompiling its code, and this doesn’t work because that information isn’t available until you try to run the game. That’s why Dolphin has an interpreter/JIT.
>This is the same question as why you can’t emulate a game by precompiling its code, and this doesn’t work because that information isn’t available until you try to run the game.

I mean technically you can, but it generally requires a bunch of inefficient jump tables, or alternatively a way to fall back to an interpreter or JIT for self modifying code.