Hacker News new | ask | show | jobs
by ddelnano 231 days ago
Wouldn't the Nsight Systems suite provide coverage here? Are the tricky cases difficult to debug with the standard CUDA tooling stack?
1 comments

Yes, nsys is very helpful, especially when looking at perf issues. It’s often the case that bugs present like in this blog though - you just notice that training curves have regressed somehow - so even with good tooling it can be hard to figure out where to start looking in these very complex systems. Only gets worse if the symptoms only show up when running for a long time and at scale in a cluster.