|
|
|
|
|
by minimaxir
1014 days ago
|
|
A robust 1.1B model compared to a 7B model would be strongly appreciated. The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100; dropping it by an order of magnitude and letting it run on other cloud GPUs will open new opportunities. |
|
?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.
I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.
I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.