| Hey! Great question! That's what I'm confused about as well! So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately: 1. Flash Attention v2 reduces the time taken by 17% or so 2. RoPE Triton kernels: -7.1% 3. RMS Layernorm in Triton: -3.1% 4. Cross Entropy in Triton: -1% 5. Manual autograd for MLP: -4% 6. Manual QKV autograd: -2% 7. Manual O autograd: -2% 8. Smart cache evictions and reduced data duplications etc: -30% 9. And other tricks in the Max and Pro versions makes it 30x faster You can see it's just tricks in each step, which accumulate together to make to go faster. I'll write up a blog post to detail it all in the future!!! |
This feels like the collecting underpants meme. Phase 1: Get to the same performance as other methods. Phase 2: ???. Phase 3: Now you're at 750%!
You may or may not actually have succeeded at what you claim to, but you're not being very persuasive. I realize that you're trying to turn these tricks into a profit and revealing them would destroy that possibility, but you're going to have a really hard time persuading people to pay for a product that does something that enormous teams of PhDs at BigTech haven't been able to pull off on the basis of "trust me".