Hacker News new | ask | show | jobs
by berkes 58 days ago
I've been using devstral2 with great success for a few months now. The hosted version, not running one locally or such. Devstral is open.

Devstral is good, Opus better. But not much. For me, "good" is "good enough". The difference, IME lies in context engineering: skills, agents.md, subagents, tools, prompts. A Devstral with good skills performs far better than an "blank" claude code. Claude with good skills performs even better, but hardly noticable, IME.

I am convinced I've plateaued. Better performance comes from improving skills and other "memory", prompting smarter, better context management and, above all, from the tooling around it and the stability of the services.

I do still run Claude with Opus alongside Mistral with Devstral2. Sometimes to just compare outputs, often to doublecheck, but mostly to doublecheck my statement that the difference between Devstral2 and Opus is marginally and easily covered by better context engineering.

2 comments

Perhaps. I’d like to like Devstral because I’d rather give my money to an European business.

My experience with it in an existing codebase has been that it gets to results much more reliably than Gemini Flash or Haiku, but it will cut corners and write incomprehensible code even with a good Opus plan to boot.

It’s true that the context and tooling might help, but setting everything up and finding the arcane mix of correct MCPs/skills is a job in itself right now. What I do see is that I’ve wasted months trying to get good code out of Gemini, Devstral2, and a good experience out of stuff like OpenCode and everything under the sun.

> is a job in itself right now.

Yes, exactly. I consider this the core of my job now: herding agents.

I reminds me of the time that I "herded" juniors, interns and new hires very much.

And my experience is that OpenCode et.al. don't do a "Good Enough" job. It's better, than e.g. Devstral2, but without guidance, still not sufficient. I think that mostly has to do with a combination of my experience and standards and of my languages and niches.

All of them are good enough for throwing out a react spagetti, one you'd expect from fiverr or from an intern: don't look under the hood, just drive it (launch it and leave it). Claude is far better in such a "benchmark" than e.g. Devstral2.

But when I need a hexagonal-architectured, TDD and BDD covered microservice in python with zero type warnings, all models fail spectacularly out of the box. I presume their training body isn't "used" to such patterns: it's statistically unlikely to ignore type warnings in Python (wink). Just like it's statistically unlikely to write a few files of typescript for a feature, instead of pulling in an node package. Turns out esp. with claude code, it's statistically likely to comment out failing tests if the rule is "ensure all test pass" and this one hard to fix¹.

So to get this level of what we require, I need tons of rules, guidelines, skills and whatnot. On every model. So I'll just as well - indeed - pipe my money into an EU company that's cheaper and has the option of self-hosting when s* starts hitting fans.

--- ¹ I think I finally found the "context" to fix this, though. What I used to tell my interns/juniors is to take a step back and re-think the shape of things: a difficult or complex test usually means the code it is testing needs re-architecturing. Something most agents will refuse: and good, because it's side-tracking them. My solution is to tell agents to stop, document the problem, and if obvious, document the solution as well in a dedicated "technical debt" markdown file. Then in future I'll direct another agent at this file and tell it to start fixing them one at a time.

I agree with all you’ve said.

Gemini loves deleting tests as well, and all of them will relentlessly stub things to make unit tests ‘easy’.

What experience brought me is knowing where to steer them, e.g. scraping all their shitty glue code and hand-holding Sonnet into implementing classes, DI, and unit tests that aren’t brittle at all. In that way, the agents have been nice to work with: they remind us of why cleaner code and good practices make for maintainable code. I hate their React spaghetti, but most places I’ve worked had tons of React spaghetti anyway…

All of this said: I actually miss steering juniors instead. Humans are frustrating to work with, but they are also adaptable, grow with time, and are… you know, human.

Mentoring Claude isn’t exactly fun or rewarding, in the way mentoring a colleague would be. And thankfully we have memory MCP servers, otherwise it would be like mentoring a brand new intern every time you fire up Claude.

Someone just asked my what I dislike most about Mistral and about Claude code.

I run both in zed editor. Claude codes' integration is subpar - it's ACP does not report tasks, doesn't give diffs and so on.

Mistral has rate limits that I hit just too often. I'm now using Mistral Pro, where this is worse, using pay-as-you-go is better but costs me 10x the pro. The agent then stops with an error.