| HN Mirror

Hi, thanks a lot for the feedback! I'm glad you enjoyed the profiling sections.

You've hit the nail on the head regarding the missing pieces. I actually hit a bit of a wall with my current hardware; using an RTX 2070 made it difficult to meaningfully explore the async loading (TMA) and pipelining optimizations that were used in FA3 and FA4. I also felt the write-up was already pushing the limits of a single post's length, so I decided to "ship it" as a first part.

I would love to dive into TMA for Part 2. If I can get my hands on an H100 (or even an A100), that's highly appreciatediated on my end! If you have any leads on hardware access, please let me know—I’d love to finish the story!