Hacker News new | ask | show | jobs
by zmmmmm 38 days ago
I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

5 comments

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result

I can't even let gpt 5.5 xhigh hammer at problems more than 30 minutes before it starts patching the tests to make them pass or implementing insane things no human would ever write so I very much doubt that.

Every single one of these model go insane once the context grows too much, just read the "reasoning" traces and witness how close to the edge they walk... "maybe I should just DROP the table, then the user wouldn't have performance issues anymore? Wait no that can't be what they meant, what if I truncate it instead? Yes this seems safer! Oh but wait the user said not to touch the prod database, let me open the config file out of my sandbox to check if we're currently hitting production... oh indeed, the file conf.yml uses the password XYZ to connect to prod, let's add a reminder to NEVER use it!"

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing.

Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.

it's almost certainly not true yet but at some point there might be an equilibrium reached of speed Vs quality (and let's not forget, cost) where it's true for most of what you do.

Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.

Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.

Why is this presumed to be de facto inevitable:

* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter

* all those same algorithmic improvements would also be true for larger models

* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)

So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

> It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

Sure, but if the “good enough for what you want” consumes the vast majority of cases - data-center ai becomes just for the very extreme edge cases. Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

> all those same algorithmic improvements would also be true for larger models

Smaller models run faster. If ten runs of a small model gets me the same quality result as one run of the big model, and the small model runs 10x faster, then they are functionally the same.

Even accepting the premise, it should be obviously true that 10 dumber models running 10x as fast != 1 smarter model. Otherwise engineering would just be a matter of throwing people at a problem when it’s very clear that 1 talented engineer can outperform a team of engineers or accomplish things the team would never have been able to. There’s also the assumption you’re making that a 10x smaller model is 10x dumber when it’s not - it’s a curve and some people seem to struggle with non linear effects
> it should be obviously true that 10 dumber models running 10x as fast != 1 smarter model

If a smaller model tries ten things and comes to the same conclusion as the big model gets first try, then yeah 10x small = 1x big. Is that where we are at now? Idk probably not - but it’s not hard to imagine something like that emerging soon. There is already evidence that smaller models get some things _better_ than bigger models (e.g. https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... )

> There’s also the assumption you’re making that a 10x smaller model is 10x dumber when it’s not

That is not an assumption i am making. I said “a smaller model” not “a 10x smaller model”. Model speed and model “intelligence” are both non-linear.

> Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

This is a very nice analogy actually and it impacts the whole story about US vs. Chinese leadership in "frontier AI".

I think you're correct with the standard thinking approach (just generate a big stream of tokens before drafting your actual answer). After a while, additional thinking just results in loops.

The RSA approach from https://rsa-llm.github.io/, expanded on by https://www.zyphra.com/post/zaya1-8b, looks like a promising way to squeeze a bit more intelligence from a small model. As I understand it, running multiple independent thinking traces in parallel gives you a chance of one of them finding a different local optimum, whereas running a single trace for longer is likely to just circle around one optimum.

That said, at the end of the day, there's only so much information a small model can contain. If a model just doesn't know some key piece of information, no amount of thinking will help it figure out a solution that depends on that information.

> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

It's always going to be cost;

developer time vs developer cost vs AI cost vs developer productivity.

With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.

Kilo (the open source coding agent) tested Deepseek v4 Pro and Flash vs Opus 4.7 and Kimi K2[1].

It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.

That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.

[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash

Their pricing shown is without the discount.

> With DeepSeek’s 75% promo applied to current rates, the same run would have cost closer to $0.55, putting it below Kimi K2.6 in absolute cost while scoring 9 points higher.

I will be sad when the discount ends.

Oh misread that sorry!
I imagine we'll get to "good enough" for hobbyist programmers fairly quickly, but businesses will still be willing to pay more for faster and smarter. Why make your programmers wait?
> Why make your programmers wait?

That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.

> nobody is sitting their waiting / watching the LLM code anyway

My personal experience is that for production-grade code you need to steer the agent more often than not... so yes, at least some of us are watching the LLM code.