Hacker News new | ask | show | jobs
by chessgecko 865 days ago
This is wrong, being memory bound or not has to do with the dimensions of the matrices being multiplied (if you’re on tensor cores). https://docs.nvidia.com/deeplearning/performance/dl-performa...

Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large