Hacker News new | ask | show | jobs
by minimaxir 1014 days ago
A robust 1.1B model compared to a 7B model would be strongly appreciated. The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100; dropping it by an order of magnitude and letting it run on other cloud GPUs will open new opportunities.
2 comments

> The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100

?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.

I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.

I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.

> ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read,

That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that…

Your point is anticipated by the next sentence in the comment you replied to:

"A somewhat bigger consumer GPU can batch it and serve dozens of users."

Did you not read it?

“dozens” doesn't really change the economics here. A SaaS can serve a thousand of concurrent users on computers that is the price of a 4090, so we're still 2 orders of magnitudes off compared to regular SaaS business models.
Sure it does:

- Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.

- A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour; this is basically the same as the power going into it. It's less than a penny per hour per concurrent on a task like this.

- Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.

- If you hit massive scale and want to buy A100s to improve the economics because you're drowning in business, you can go ahead and readily do that at that time...

> A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour;

But that's not how it works: you need to have enough of it to accommodate for peek usage, but a good fraction of that isn't going to be running most of the time. You'd end up with a cost that's not too far from what Cloud providers are offering, which is a roughly 3 times that price. And you need to pay for the whole server hosting these GPUs (this less of a factor when you're using big GPUs like H100, but if you want to stick with consumer-grade GPUs, then the host is still a non-trivial fraction of the cost, and your supporting a server for a small bunch of concurrent users, which means your infra team is going to work with a massive pool of servers very quickly, with all the associated costs).

> It's less than a penny per hour per concurrent on a task like this.

It's still two orders of magnitude more expansive than any other SaaS business.

> Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.

Maybe, but then again you're trying to build a service that has to add much more value than what the typical SaaS start-up provide.

Also regarding this:

> - Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.

ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:

- this market is already taken by them, so your startup isn't gonna do the same.

- when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).

A single GPU with a batch size of 1 can serve many users, higher batch sizes can serve many dozens, pool a few and you can serve a sizable userbase.

It may not be super profitable, but its not untenable either.

LLMs are GPU compute-bound. If you infer at batch_size = 1 on a model like Llama 2 7B on a "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute, which means you get no benefit from batching.

The exception is the A100 GPU which does not use 100% of GPU compute and therefore you get benefit from batching, but is hella expensive.

The economics are not simple, and in most cases "just use the ChatGPT API" is also the most cost-effective option anyways. A smaller 1.1B model (which would likely not be compute-bound) with similar performance to a 7B model may tip the scales.

> LLMs are GPU compute-bound.

From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.

It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...

> "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute,

LLM with batch_size=1 technically cannot use '100%' of GPU. Because it has to move a lot of data around and use different blocks of GPU. So, when tensor cores are used cuda cores are idle. Tensor cores are used for matrix multiplication, cuda cores for activation functions (I'm simplifying). Model has to use both at different times moving data between them. Meanwhile GPU monitor may report 100%. But it's still possible to insert another process. I think I've seen this idea in Pytorch docs.

As for 1.1B LLM, it would be nice. Interesting experiment anyway. I'm only afraid that with big and diverse dataset model will focus more on memorization and generic logic may not emerge. They aren't doing anything new in terms of architecture and training methods.

That's still wildly too expensive if you want to make a profitable service that is scalable beyond VC capital injections.
1.1B with 3T tokens will never be comparable to 7B with 2T tokens.

And I'm not sure what you mean by inference latency being infeasible. Most people using thsss models at home don't even bother with the 7B and go straight to 13B because it's easy to run too and much smarter. And any cloud gpu can run 13B.