Hacker News new | ask | show | jobs
by billconan 590 days ago
WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?
2 comments

WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications
> don't have support for hardware specific memory

I have no experience with WebGPU but if you mean group shared memory, I think the support is available. See the demo: https://compute.toys/view/25

i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb

i'm excited to try subgroups though: https://developer.chrome.com/blog/new-in-webgpu-128#experime...

I've heard the WebGPU workgroup wants to close the gap on tensor core support.
you're definitely right, 80% was a bit of an overestimation, especially with respect to CUDA

it would be cool to see if there's some way to get better access to those lower-level primitives but would be surprised

it does seem like subgroup support are a step in the right direction though!

great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware