| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by granitepail 364 days ago
	While the benchmarks all say open source models Kimi and Qwen outpace proprietary models like GPT 4.1, GPT 4o, or even o3, my (and just about everyone I know's) boots on the ground experience suggests they're not even close. This is for tool calling agentic tasks, like coding, but also in other contexts (research, glue between services, etc). I feel like it's worth putting that out there--it's pretty clear there's a lot of benchmark hacking happening. I'm not really convinced it's purposeful/deceitful, but it's definitely happening. Qwen3 Coder, for example, is basically incompetent for any real coding tasks and frequently gets caught in death spirals of bad tool calls. I try all the OSS models regularly, because I'm really excited for them to get better. Right now Kimi K2 is the most usable one, and I'd rate it at a few ticks worse than GPT 4.1.

5 comments

daft_pink 364 days ago

isn’t the problem with the benchmarks that most people running ai locally are running way lower weights?

i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b

apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable

Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.

link

BoorishBears 364 days ago

No, I've deployed a lot of open weight models and the gap between closed source is there even at larger sizes.

I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance

I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.

GPT-3.5 has better world knowledge than some 70B models, and a few even larger.

link

laardaninst 363 days ago

The big "frontier" models are expert systems built on top of the LLM. That's the reason for the massive payouts to scientists. It's not about some ML secret sauce, it's about all the symbolic logic they bring to the table.

Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.

That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.

link

daft_pink 364 days ago

you're killing my dream of blowing $50-100k on a desktop supercomputer next year and being able to do everything locally ;)

"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.

link

meaydinli 364 days ago

Take a look at: https://www.nvidia.com/en-us/products/workstations/dgx-spark... . IIRC, it was about ~$4K.

link

PeterStuer 363 days ago

Given that for a non quantized 700B monolithic model with let's say a 1M token context, you would need around 20TB of memory, I doubt your spark or M4 will get very far.

I'm not saying those machines can't be usefull or fun, but it's not in the range of the 'fantasy' thing you're responding to.

link

daft_pink 363 days ago

I regularly use Gemini CLI and Claude Code, and I'm convinced that Gemini's enormous context window isn't that helpful in many situations. I think the more you put into context, the more likely the model is to go off into on a tangent and you end up with "context rot" or get confused and start working on an older no longer relevant context. You definitely need to manage and clear your context window and the only time I would want such a large context window is when the source data is really that large.

link

phonon 364 days ago

An M4 Max twice the memory bandwidth (which is typically the limiting factor)

link

BoorishBears 364 days ago

I'll say neither of them will do anything for you if you're currently using SOTA closed models in anger and expect that performance to hold.

I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.

link

granitepail 364 days ago

In my case, I’m paying for inference on the original models from e.g. Fireworks. So it’s not a quantization problem. The Qwen3 I was using was the new 458B (i think that’s the size?) model that was their top performer for code.

I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.

So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.

link

refulgentis 364 days ago

I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)

link

PeterStuer 363 days ago

Does your inference framework target the NPU or just GPU/CPU?

link

refulgentis 363 days ago

It's linking llama.cpp and using Metal, so I presume GPU/CPU only.

I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)

link

PeterStuer 362 days ago

LMstudio seems to have MLX support on Apple silicon so you could quickly have a feel for whether it helps in your case https://github.com/lmstudio-ai/mlx-engine

link

n_kr 364 days ago

It may be the way I use it, but qwen3-coder (30b with ollama) is actually helping me with real world tasks. Its a bit worse than big models for the way I use it, but absolutely useful. I do use ai tools with very specific instructions though, like file paths, line numbers if I can, and specific direction about what to do, my own tools, etc. so that may be why I don't see such a huge difference from big models.

I should try Kimi K2 too.

link

Art9681 364 days ago

It has everything to do with the way you use it. And the biggest difference is how fast the model/service can process context. Everything is context. It's the difference between you iterating on an LLM boosted goal for an hour vs 5 minutes. If your workflow involves chatting with an LLM and manually passing chunks, and manually retrieving that response, and manually inserting it, and manually testing....

You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.

Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.

That's where the moat, no, the chasm, between local setups and a public API lies.

No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.

That's where local models fold.

link

refulgentis 364 days ago

You'll see good results, Kimi is basically a micro dosing Sonnet lol. V v v reliable tool calls, but, because it's micro dosing, you don't wanna use it for implementing OAuth, maybe adding comments or strict direction (i.e. a series of text mutations)

link

torginus 364 days ago

Not sure about benchmarks, but I did use Deepseek when it was novel and cool for a variety of tasks before going back to Claude, and in my experience it was OK, not significantly worse for what I use these models for (writing code small functions at a time, learning about libraries etc.), tham closed stuff at the time.

link

lossolo 364 days ago

While that's true for some open source models, I find DeepSeek R1 685B 0528 to be competitive with O3 in my production tests, I've been using it interchangeably for tasks I used to handle with Opus or O3.

link

jimbo808 364 days ago

I would have assumed anyone frequenting HN would have figured out by now that benchmarks are 100% bullshit. I guess I'd be wroing.

link

andrewmcwatters 364 days ago

I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself.

Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.

link

dist-epoch 364 days ago

So what do you propose? Gut feel, N=1 tests?

link

int_19h 364 days ago

At the moment, the only way you can tell if the model is good for a particular task is by trying it at that task. Gut feel is how you pick the models to test first, and that is also based largely on past experience and educated guesses as to what strengths translate between tasks.

You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.

link

sebzim4500 360 days ago

To some extent there must be a free lunch, because today's 30B models are enormously better than the 30B models that existed a year ago.

I suppose it's an open question whether there is another free lunch or whether the 30B models in a year will be not much better than our current ones.

link

spullara 364 days ago

it currently beats depending on the benchmarks

link

BoorishBears 364 days ago

I mean, in other environments people say that.

If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.

Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)

link