Hacker News new | ask | show | jobs
by simon_vtr 379 days ago
It doesn’t. The batch size is just 8. This is a very good trick and often needed to archive peak performance in memory bound kernels. You can checkout the equivalent code in cuda aswell :)