| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 50 days ago

A word of caution on this.

I've tried this too, and was disappointed.

Kimi generally benchmarks at "a bit more intelligent than Sonnet Medium" levels[1] and I'd agree broadly with this assessment.

If you have adapted your coding to rely on the agentic style that is doable in Opus 4.7+ then you will find Kimi disappointing.

If you are using it in a more targeted way then it can work well.

[1] https://artificialanalysis.ai/agents/coding-agents?agents=cl...

2 comments

kouteiheika 50 days ago

Yes, I would agree with this.

I think it works best when you're using the agent in a more hands-on way with a targeted prompt. If you're obsessive about code quality like I am (so you thoroughly review and, when needed, reprompt or even rewrite what the agent does) then you'll be fine, but if you like to just throw a prompt at the wall and expect it to plan and execute the whole thing perfectly then you'll be disappointed.

A middle-ground trick one can use is to have Opus (or Fable now) plan the whole thing and get something cheaper like Kimi execute on it.

link

rented_mule 50 days ago

CodeWhale (formerly deepseek-tui) automates this over DeepSeek V4 Flash and Pro. My shallow understanding is that it prompts the model to evaluate the complexity of a given task, then decides on Flash vs. Pro at various reasoning levels for that task. This can help with both cost and speed. If other agent platforms don't already do this, I have to imagine they will at some point.

I'm retired and can't justify spending too much on these things. CodeWhale over DeepSeek is helping me understand this space much better (and have some fun!), and it's quite affordable. I've spent ~30 hours using it over the last couple of weeks, and I've spent $3.89 on DeepSeek in that time. If I don't feel like writing any code for a few weeks, I pay nothing. Looking at DeepSeek's dashboard, about 60% of my requests have gone to Pro and 40% to Flash. I've used 97M Pro tokens and 19M Flash tokens (well over 90% of each have been cache hits, so the price is much lower than it would otherwise be).

link

emodendroket 49 days ago

Cursor's Auto mode is built on this premise though I can't say how effectively it categorizes with limited experience.

link

selicos 50 days ago

This is in the direction of Mixture Of Export (MOE) setups. A trained 'router' sits on top of different expert models and routes work to the best/most efficient model for that task, and integrates the work into a whole to provide to the user.

At least, that is what I get from the MOE style. Small and fast experts with a router LLM on top to best use them, then the harness to keep it all together.

link

nl 49 days ago

> Small and fast experts with a router LLM on top to best use them

A router LLM isn't a MoE.

A MoE is a type of LLM architecture, not lots of different LLMs. They are fundamentally different concepts and it is a fundamental misunderstanding to conflate the two.

link

poly2it 49 days ago

Is there any open model that can emulate the agentic experience you get with Opus 4.7?

link

rstuart4133 49 days ago

GLM 5.1 gets close to 4.6. It can happily run for hours and achieve a result. It given it bugs like a race condition that lead to a count being out by 1 after millions of operations, somewhere in a hundred thousand lines of C code littered with locks and atomic swaps, and it found (as did Opus). Most other models can't.

I'm using Fable now and GLM 5.1 doesn't really compare. But it's literally 1/20 the price. I can't use Fable for coding - it's too expensive. So now we have three levels of models - lightweight ones you dispatch en masse to find things, ones capable of agentic coding tasks that can run for hours like Opus, and GLM (and possibly open source ones - I've only tried a few), and now Fable, which is a truly helpful "architecture buddy". Fable still makes many, many, mistakes, so you have to review every word it writes.

link

nl 49 days ago

Not yet that I've tried, and I'm pretty systematic about test driving them.

I keep https://sql-benchmark.nicklothian.com/#all-data up-to-date with latest releases and try out most that score 24+.

GPT 5.5+ or Opus 4.6+ are the only things I find useful like this. Notably Gemini isn't useful in this way.

link