| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by moonchrome 1052 days ago

We are talking about 7B models ? Those can run on consumer GPUs with lower latency than A100s AFAIK (because gaming GPUs are clocked different).

Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.

I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.

2 comments

minimaxir 1052 days ago

"Running" and "acceptable inference speed and quality" are two different constraints, particularly at scale/production.

link

moonchrome 1052 days ago

I don't understand what you're trying to say ?

From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.

And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.

link

7speter 1052 days ago

When you say it can run on consumer gpus, do you mean pretty much just the 4090/3090 or can it run on lesser cards?

link

halflings 1052 days ago

I was able to run the 4bit quantized LLAMA2 7B on a 2070 Super, though latency was so-so.

I was surprised by how fast it runs on an M2 MBP + llama.cpp; Way way faster than ChatGPT, and that's not even using the Apple neural engine.

link

hereonout2 1052 days ago

It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.

link

gsuuon 1052 days ago

Quantized 7B's can comfortably run with 8GB vram

link