Hacker News new | ask | show | jobs
by animatethrow 946 days ago
Does WebGPU enable a pure 2D game with many sprite animations to not need to pack a texture atlas for best performance? I.e. can I tell the GPU, "Draw these 1000 quads using the following distinct 1000 textures using just this one draw call. I'll be changing the 1000 textures each frame." My experience with OpenGL and D3D11 is that an atlas is the only way to do this. (I've found stb_rect_pack.h to be the least hassle route to packing the atlas.) I started looking at D3D12 and saw that it had command recording but it wasn't clear to me if this is any more efficient for a 2D game than just using D3D11/OpenGL to send 1000 separate pairs of commands to set-this-texture/draw-this-quad. With D3D12 the CPU is still performing thousands of function calls per frame to "record" these commands and I don't see how this is cheaper than having D3D11 do thousands of draw calls. D3D11 just puts a draw call into a command queue and immediately returns so isn't this effectively kinda the same thing as using ID3D12CommandQueue "command recording"? I never got around to benchmarking or learning more, so I'm sure I must be misunderstanding the advantages. I've also noticed that despite D3D12 launching back in 2015, the #1 most used engine Unity is still defaulting to D3D11 and has struggled to make D3D12 as performant/stable. So it seems I'm not the only one who can't figure out how the newer APIs offer more performance.
2 comments

WebGPU currently doesn't support the "bindless" resource access model (see: https://github.com/gpuweb/gpuweb/issues/380).

The "max number of sampled texture per shader stage" is a runtime device limit, and the minimal value for that seems to be 16. So texture atlasses are still a thing in WebGPU.

WebGPU has render bundles, which allow to pre-record command sequences, but even with that you don't want to change resource bindings thousands of times per frame (or even hundreds of times).

It might make sense though to build texture atlases dynamically (basically use one very big texture as "tile cache") and update that via writeTexture() calls (just don't rebuild the entire atlas each frame).

The new modern APIs are not to be understood as graphics APIs, rather as GPU APIs, thus using them directly is more akin to writing a graphics device driver than a rendering engine.

Most people are better served by using GL 4.6, DX 11 and such, or a middleware engine.

Even console vendors have multiple APIs because of that, not everyone needs all little details of the GPU.

Unity is supposed to have good DirectX 12 coming up, by the way.

"Achieving Real Time Ray Tracing on Xbox with Unity and DirectX12"

https://www.youtube.com/watch?v=giaEpbBGc6E

> The new modern APIs are not to be understood as graphics APIs, rather as GPU APIs

I've heard this repeated a lot in internet forums, but my experience working through hundreds of pages of Frank Luna's 800+ page DX12 book before concluding it pointless for 2D was that DX12 is actually fundamentally very similar to DX11 with most of the API focused on graphics rather than general compute. Compute shaders are just one chapter (13) of Luna's book, roughly 40 of the 800+ pages. I did some of the LunarG Vulkan tutorial and browsed the Khronos ref pages and reached a similar conclusion for Vulkan. I played with CUDA a bit and that's what real GPU programming looks like, almost no mention of graphics for much of the documentation. The "hello world" program isn't drawing a triangle, it's adding two arrays. Whereas a great deal of DX12, Vulkan, etc. is all about pixel formats, pixel shading, swap chains, geometry and tessellation, blending, depth and stencil, mipmaps and cube maps, clip coords, triangle winding and culling, the perspective Z divide, viewports, indexed and instanced draw calls, ... you know, graphics. But in chapter 9 of the DX12 book, end of 9.4, Luna writes, "Texture atlases can improve performance because it can lead to drawing more geometry with one draw call." So the conclusion I reached is that DX12 doesn't offer some fancy GPU compute way of writing a GPU program that can use 1000 distinct textures to draw 1000 distinct quads using only one CPU function call to launch this GPU program.

Now I've been doing more research and there is some sort of new feature called bindless textures not covered in Luna's DX12 book that might accomplish what I want (I'm not sure), but it seems to be Win11 only, WDDM 3.0 only, shader model 6.6 only, very new cards only. With this feature I might be able to set up 1000 distinct integer ids for my 1000 distinct textures, and then, with one single CPU draw call, have those 1000 textures applied to the correct 1000 quads, with no need to pack those 1000 textures into an atlas. Doing more web searching just now, this possibly can also be done in OpenGL on cards that support NV_gpu_shader5, but only semi-recent nVidia cards might support this. (I'm finding it difficult to get quick, quality answers to these sorts of questions using either web searches or LLMs.) Anyway, a gamedev forum or DX-focused reddit might be a better place for me to ask these sort of technical questions.

If I understand correctly, what you are looking for are mesh shaders and shader work graphs, which allow one to basically do most of the compute stuff on the GPU without having the CPU steering anything, besides setting up the whole chain.

You will need DirectX 12 Ultimate or Vulkan for them.

https://developer.nvidia.com/blog/introduction-turing-mesh-s...

https://microsoft.github.io/DirectX-Specs/d3d/MeshShader.htm...

https://www.khronos.org/blog/mesh-shading-for-vulkan

https://devblogs.microsoft.com/directx/d3d12-work-graphs-pre...

https://gpuopen.com/learn/gpu-work-graphs/gpu-work-graphs-in...

https://gpuopen.com/gpu-work-graphs-in-vulkan/

So I read through the materials on mesh shaders and work graphs and looked at sample code. These won't really work (see below). As I implied previously, it's best to research/discuss these sort of matters with professional graphics programmers who have experience actually using the technologies under consideration.

So for the sake of future web searchers who discover this thread: there are only two proven ways to efficiently draw thousands of unique textures of different sizes with a single draw call that are actually used by experienced graphics programmers in production code as of 2023.

Proven method #1: Pack these thousands of textures into a texture atlas.

Proven method #2: Use bindless resources, which is still fairly bleeding edge, and will require fallback to atlases if targeting the PC instead of only high end console (Xbox Series S|X...).

Mesh shaders by themselves won't work: These have similar texture access limitations to the old geometry/tessellation stage they improve upon. A limited, fixed number of textures still must be bound before each draw call (say, 16 or 32 textures, not 1000s), unless bindless resources are used. So mesh shaders must be used with an atlas or with bindless resources.

Work graphs by themselves won't work: This feature is bleeding edge shader model 6.8 whereas bindless resources are SM 6.6. (Xbox Series X|S might top out at SM 6.7, I can't find an authoritative answer.) It looks like work graphs might only work well on nVidia GPUs and won't work well on Intel GPUs anytime soon (but, again, I'm not knowledgeable enough to say this authoritatively). Furthermore, this feature may have a hard dependency on using bindless to begin with. That is, I can't tell if one is allowed to execute a work graph that binds and unbinds individual texture resources. And if one could do such a thing, it would certainly be slower than using bindless. The cost of bindless is paid "up front" when the textures are uploaded.

Some programmers use Texture2DArray/GL_TEXTURE_2D_ARRAY as an alternative to atlases but two limitations are (1) the max array length (e.g. GL_MAX_ARRAY_TEXTURE_LAYERS) might only be 256 (e.g. for OpenGL 3.0), (2) all textures must be the same size.

Finally, for the sake of any web searcher who lands on this thread in the years to come, to pack an atlas well a good packing algorithm is needed. It's harder to pack triangles than rectangles but triangles use atlas memory more efficiently and a good triangle packing will outperform the fancy new bindless rendering. Some open source starting points for packing:

https://github.com/nothings/stb/blob/master/stb_rect_pack.h

https://github.com/ands/trianglepacker