Hacker News new | ask | show | jobs
by ffast-math 1458 days ago
No ML frameworks implement it yet, though I'd be happy to work with people from the PyTorch/TF/JAX/CUDNN/CUTLASS/etc. teams (or volunteers) if anyone wants to make this happen.

Also, while you can get 200x compression, I do want to emphasize that there's a speed vs quality tradeoff and the results will vary by problem. We have much more careful statements in the paper about the exact problem setup, tradeoffs, etc. Also, as I've mentioned in other comments, it probably won't help too much on modern GPUs due to their acceleration of dense GEMMs but not shuffles. CPU inference is the killer app here.