|
|
|
|
|
by littlestymaar
1016 days ago
|
|
> ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that… |
|
"A somewhat bigger consumer GPU can batch it and serve dozens of users."
Did you not read it?