|
|
|
|
|
by brucethemoose2
1018 days ago
|
|
> The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100 ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users. I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM. I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar. |
|
That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that…