Hacker News new | ask | show | jobs
by ActorNightly 3 days ago
Very false.

I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.

Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.

3 comments

> Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.

On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.

[1] https://www.reddit.com/r/LocalLLM/comments/1kshq4f/electrici...

Maybe not per watt, but unless you already happen to own a 3900 cited by that post, you'd have to buy that as well, which is currently selling for around $1400 used.
3090s are running $1400 now? Wowsers. I thought I was overspending when I bought 6x of them for around $800 a pop.

Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.

I'll take two of them. A thousand a piece.
I do have a 3090 Ti on my gaming PC, but even my old M1 MBP (with a mere 32gb of RAM) is quite competent and can run a quantized `Gemma4-26B-A4B` in the background while I do other stuff.
The MBP running Gemma4 is absolutely is useless for any real work.
What is "real work"?
Where you are developing software. Its significantly faster to use google gemini and copy paste code back and forth compared to having gemini edit files for you.
To be fair, I can also use that 3900 for other things locally. Not just AI.
well to be fair that's right now, I think the question is what about in 6 months, 12 months, 2 years?

Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?

Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.

>Where do these improvement curves go?

Nowhere.

Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics

For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)

Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.

The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.

Just need a decent Mac Studio and they are plentiful in used condition and affordable.