Hacker News new | ask | show | jobs
by liuliu 180 days ago
I hope you finish this one though. It starts strong (I particularly liked how you looked into ncu and shows what each recommendation means, this is very helpful for beginners), but ends with something not satisfying. You didn't explore tensor core (particularly, fp16 / tf32 / bf16), and swizzling (which is the right way to solve the K transpose issue, especially giving Triton itself provides a few ways to do this), and / or async loading (pipelining).

Do you have problem to access H100 or similar chips? Wondering if there anything can help to finish this write-up.

1 comments

Hi, thanks a lot for the feedback! I'm glad you enjoyed the profiling sections.

You've hit the nail on the head regarding the missing pieces. I actually hit a bit of a wall with my current hardware; using an RTX 2070 made it difficult to meaningfully explore the async loading (TMA) and pipelining optimizations that were used in FA3 and FA4. I also felt the write-up was already pushing the limits of a single post's length, so I decided to "ship it" as a first part.

I would love to dive into TMA for Part 2. If I can get my hands on an H100 (or even an A100), that's highly appreciatediated on my end! If you have any leads on hardware access, please let me know—I’d love to finish the story!