Great writeup. The three-pass softmax is where everyone
gets stuck — subtracting the max for numerical stability
is one of those things you can't learn from a textbook.
The pain points you hit (byte alignment, dispatch dims,
strict typing) make me wonder if there's a sweet spot
between raw WGSL and "import pytorch" that keeps you
close to the metal without all the papercuts.
The pain points you hit (byte alignment, dispatch dims, strict typing) make me wonder if there's a sweet spot between raw WGSL and "import pytorch" that keeps you close to the metal without all the papercuts.