| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by germanjoey 441 days ago
	TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

2 comments

godelski 441 days ago

I'm a bit confused. It sounds like you are disagreeing ("TBH") but the content seems like a summary of my comment. So, I agree.

Fwiw, they did say they got up to 20x improvement but given the issues we both mention this may not be surprising given that this seems to be an outlier by their own omission.

link

jaberjaber23 441 days ago

absolutely. it really depends on the kernel type, target architecture, and what you're optimizing for. the 2x-4x isn’t the limit, it's just what users often see out of the box. we do real-time profiling on actual GPUs, so you get results based on real performance on a specific arch, not guesses. when the baseline is rough, we’ve seen well over 10x

link