| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by daft_pink 364 days ago

isn’t the problem with the benchmarks that most people running ai locally are running way lower weights?

i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b

apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable

Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.

3 comments

BoorishBears 364 days ago

No, I've deployed a lot of open weight models and the gap between closed source is there even at larger sizes.

I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance

I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.

GPT-3.5 has better world knowledge than some 70B models, and a few even larger.

link

laardaninst 363 days ago

The big "frontier" models are expert systems built on top of the LLM. That's the reason for the massive payouts to scientists. It's not about some ML secret sauce, it's about all the symbolic logic they bring to the table.

Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.

That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.

link

daft_pink 364 days ago

you're killing my dream of blowing $50-100k on a desktop supercomputer next year and being able to do everything locally ;)

"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.

link

meaydinli 364 days ago

Take a look at: https://www.nvidia.com/en-us/products/workstations/dgx-spark... . IIRC, it was about ~$4K.

link

PeterStuer 363 days ago

Given that for a non quantized 700B monolithic model with let's say a 1M token context, you would need around 20TB of memory, I doubt your spark or M4 will get very far.

I'm not saying those machines can't be usefull or fun, but it's not in the range of the 'fantasy' thing you're responding to.

link

daft_pink 363 days ago

I regularly use Gemini CLI and Claude Code, and I'm convinced that Gemini's enormous context window isn't that helpful in many situations. I think the more you put into context, the more likely the model is to go off into on a tangent and you end up with "context rot" or get confused and start working on an older no longer relevant context. You definitely need to manage and clear your context window and the only time I would want such a large context window is when the source data is really that large.

link

PeterStuer 362 days ago

Context quality and relevance is indeed a major factor. But large size is not the core issue, although in unmaintained or poor relevance context situations a smaller window is going to blissfully forget the bad, and the good, sooner.

link

phonon 364 days ago

An M4 Max twice the memory bandwidth (which is typically the limiting factor)

link

BoorishBears 364 days ago

I'll say neither of them will do anything for you if you're currently using SOTA closed models in anger and expect that performance to hold.

I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.

link

daft_pink 363 days ago

I'm running an M4 Max as well and I found that using project goose works decently well with qwen3 coder loaded on LM Studio (Ollama doesn't do MLX yet unless you build it yourself I think) and configured as an openai model as the api is compatible. Goose adds a bunch of tools and plugins that make the model more effective.

link

phonon 364 days ago

It will be sort of decent on a 4bit 70B parameter model, like here https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M). But yeah, not great.

link

granitepail 364 days ago

In my case, I’m paying for inference on the original models from e.g. Fireworks. So it’s not a quantization problem. The Qwen3 I was using was the new 458B (i think that’s the size?) model that was their top performer for code.

I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.

So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.

link

refulgentis 364 days ago

I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)

link

PeterStuer 363 days ago

Does your inference framework target the NPU or just GPU/CPU?

link

refulgentis 363 days ago

It's linking llama.cpp and using Metal, so I presume GPU/CPU only.

I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)

link

PeterStuer 362 days ago

LMstudio seems to have MLX support on Apple silicon so you could quickly have a feel for whether it helps in your case https://github.com/lmstudio-ai/mlx-engine

link