Hacker News new | ask | show | jobs
by FanaHOVA 863 days ago
Accelerators have nothing to do with it as we're mostly memory bound by HBM <> SRAM data transfer rather than compute bound.
2 comments

It depends. Right now once we hit 6-8 bit precision inference, H100s/A100s are not memory-bound, but compute-bound.
This is wrong, being memory bound or not has to do with the dimensions of the matrices being multiplied (if you’re on tensor cores). https://docs.nvidia.com/deeplearning/performance/dl-performa...

Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large

It'll help, but GPU crunch isn't caused by people running 6-8bit inference on a single card, but by all the large scale pre-training + fine-tuning runs.
Can you link to an actual performance analysis on this?
Easy. I made tests on desktop core i7-7700 with 64G DDR4-2400. And I've tested 13B..30B..70B models on it, and you may imagine, how easy to manage how many CPU cores used.

Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).

i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.

BTW Llama.CPU is very good software.

If I’m not mistaken, for parallel inference requests and for prompt preprocessing it’s compute bound.

Also, if you have just a single model you want to optimise (and not the training), you could build an array of asics that do specific matrix computations - then you don’t need to read weights from memory at all.