| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ramesh31 1006 days ago
	>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2. You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.

1 comments

YetAnotherNick 1006 days ago

No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h

link

ramesh31 1006 days ago

Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

link

YetAnotherNick 1006 days ago

Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?

link

ramesh31 1006 days ago

Offloading 40% of layers to CPU, about 50t/s with 16 threads.

link

pocketarc 1006 days ago

That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?

link

ramesh31 1006 days ago

>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

link

jpdus 1006 days ago

Care to share your detailed stack and command to reach 50t/s? I also have a 7950 with DDR 5 and I don't even get 50 t/s on my two RTX 4090s....

link