| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FanaHOVA 863 days ago
	Accelerators have nothing to do with it as we're mostly memory bound by HBM <> SRAM data transfer rather than compute bound.

2 comments

rawrawrawrr 863 days ago

It depends. Right now once we hit 6-8 bit precision inference, H100s/A100s are not memory-bound, but compute-bound.

link

chessgecko 863 days ago

This is wrong, being memory bound or not has to do with the dimensions of the matrices being multiplied (if you’re on tensor cores). https://docs.nvidia.com/deeplearning/performance/dl-performa...

Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large

link

FanaHOVA 863 days ago

It'll help, but GPU crunch isn't caused by people running 6-8bit inference on a single card, but by all the large scale pre-training + fine-tuning runs.

link

yazzku 863 days ago

Can you link to an actual performance analysis on this?

link

simne 862 days ago

Easy. I made tests on desktop core i7-7700 with 64G DDR4-2400. And I've tested 13B..30B..70B models on it, and you may imagine, how easy to manage how many CPU cores used.

Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).

i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.

BTW Llama.CPU is very good software.

link

kolinko 863 days ago

If I’m not mistaken, for parallel inference requests and for prompt preprocessing it’s compute bound.

Also, if you have just a single model you want to optimise (and not the training), you could build an array of asics that do specific matrix computations - then you don’t need to read weights from memory at all.

link