|
|
|
|
|
by yanniszark
595 days ago
|
|
I'm not sure, it seems to me like this should be doable in Nvidia as well. This is a paper that uses instruction sampling (called CUPTI) in Nvidia to provide optimization advice: https://ieeexplore.ieee.org/document/9370339 It seems like the instruction sampler is there, and it also provides the stall reason. |
|
A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...
[0] https://arxiv.org/pdf/1804.06826