WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?
WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications
i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb
great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware