Hacker News new | ask | show | jobs
by mixermachine 66 days ago
The cutting edge, max size models will likely stay in the GPU space for a long time. But these models are not needed for most general requests. With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.

At some point we will get these models in hardware and the cost per token will be minimal.

1 comments

> With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.

These are exactly the kinds of models that you can easily run locally by repurposing existing hardware. Depending on how much you're willing to wait for the answer, running local even gives you strictly better outcomes for simple Q&A queries.

(Long-context and agentic use cases are admittedly much harder to fit under that model, since non-AI uses for the high-end hardware you'd realistically need for those are rather more limited, and they're hit by the ongoing hardware shortage.)

For programmers maybe. I do this too. But think about all the regular users out there. Your dad and your mum, maybe even your grandparents. This is a huge marked too and for that we can use these special chips at scale.