|
|
|
|
|
by mixermachine
66 days ago
|
|
The cutting edge, max size models will likely stay in the GPU space for a long time.
But these models are not needed for most general requests.
With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM.
Free users will likely only get these kinds of models. At some point we will get these models in hardware and the cost per token will be minimal. |
|
These are exactly the kinds of models that you can easily run locally by repurposing existing hardware. Depending on how much you're willing to wait for the answer, running local even gives you strictly better outcomes for simple Q&A queries.
(Long-context and agentic use cases are admittedly much harder to fit under that model, since non-AI uses for the high-end hardware you'd realistically need for those are rather more limited, and they're hit by the ongoing hardware shortage.)