|
|
|
|
|
by refulgentis
335 days ago
|
|
The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small. IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience. * the implication being, anything is possible in that scenario |
|
I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.