|
|
|
|
|
by emanuele-em
74 days ago
|
|
The finding that naive single-op benchmarks overestimate dispatch cost by ~20x is wild. Curious how much the torch-webgpu backend could close the gap with CUDA if you went aggressive on kernel fusion, 53% improvement on Vulkan already is significant. Any plans to try wgsl-level custom kernels? |
|