| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ohyes 1072 days ago
	What hardware do you have that lets you run 7b and do other stuff at the same time?

5 comments

brucethemoose2 1072 days ago

Pretty much any PC with 16GB+ of fast RAM can do this, any PC with a dGPU can do it well.

link

hmottestad 1072 days ago

Maybe a MacBook Pro. The Apple silicon chops can offload a special AI inference engine, and all ram is accessible by all parts of the chip.

link

gzer0 1072 days ago

An M1 Max with 64GB of RAM allows me to run multiple models simultaneously, on top of stable diffusion generating images non-stop + normal chrome, vscode, etc. Definitely feeling the heat, but it's working. Well worth the investment.

link

selfhoster11 1071 days ago

A 7B model at 8-bit quantization takes up 7 GB of RAM. Less if you use a 6-bit quantization, which is nearly as good. Otherwise it's just a question of having enough system RAM and CPU cores, plus maybe a small discrete GPU.

link

woadwarrior01 1071 days ago

You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.

link

kkielhofner 1071 days ago

Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.

Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.

The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3].

On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:

8 concurrent sessions (batch 1): 580 tokens/s

1 concurrent session (batch 1): 105 tokens/s

This is out of the box, I haven't spent any time further optimizing it.

[0] - https://github.com/InternLM/lmdeploy

[1] - https://github.com/InternLM/lmdeploy/blob/main/docs/en/kv_in...

[2] - https://github.com/InternLM/lmdeploy/tree/main#quantization

[3] - https://github.com/InternLM/lmdeploy/tree/main#performance

link

selfhoster11 1071 days ago

6-bit quantizations are supposed to be nearly equivalent to 8-bit, and that does chop 1.5 GB off the model size. I think a 6-bit model should therefore fit, or if that doesn't, 5-bit medium or 5-bit small surely will.

There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.

link

FrozenSynapse 1071 days ago

how's the generation speed on CPU?

link

selfhoster11 1071 days ago

On Ryzen 5600X, 7B and 13B run quite fast. Off the top of my head, pure CPU performance is about 25% slower than with an NVIDIA GPU of some kind. I don't remember the numbers off the top of my head, but the generation speed only starts to get annoying for 33B+ models.

link

_joel 1072 days ago

If you're willing to sacrifice token/s you can even run these on your phone.

link