I wonder if this is also a CUDA-bypass, PTX optimization that led to the 10x performance gain by Deepseek: https://xyzlabs.substack.com/p/deepseeks-latest-shocker-who-...