Hacker News new | ask | show | jobs
by sailingparrot 252 days ago
For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...