Hacker News new | ask | show | jobs
by 0xbadcafebee 58 days ago
Mistral models are definitely good enough. Most people fall for what I call the SOTA Logical Fallacy: whenever there is a 'better model', they think they need to use it, when less-powerful models actually perform the same tasks just as well. (it's an inverse form of the Shifting Baseline Syndrome: every time a new model comes out, people shift their baseline of what is acceptable, despite the fact that a previous baseline was acceptable for the same task)

Devstral Small 2 was (and remains) a particularly strong small coding model, even beating larger open weights. Mistral's "problem" is marketing; other providers ship model updates constantly so they remain in the news and seem like they're "beating" the competition. And it works: people get emotionally attached to brands and models, deciding who's better in the court of popular opinion, and that drives their choices (& dollars).

4 comments

My biggest issue with Devstral and even their biggest model is that they’re dangerous unless closely directed and reviewed and i mean CLOSELY. Unfortunately mistral models will believe and do anything.

See: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

See some of the test results, it’s horrifying

FWIW personally i prefer this. When i tried Qwen3.6 and asked it a few questions, while it did respond, it was ADAMANT i should do something else when i really wanted an answer to the question i made. It felt like when you search something and a stackoverflow answer about what you search for comes up and the most upvoted answer is about using/doing something else - when you want a specific answer to that specific question, not something else.

Meanwhile Devstral Small 2 just answers the damn question.

I don't want to have to convince my computer to do what i want it to do, i want from it to do what i ask it to.

> It felt like when you search something and a stackoverflow answer about what you search for comes up and the most upvoted answer is about using/doing something else - when you want a specific answer to that specific question, not something else.

Don't you think there's usually a good reason for this? Whenever this happened to me, the problem was my ignorance.

I think there is a reason why people do that: trying to steer -those they consider- newbies away from patterns they consider bad, but at the same time this second-guessing can be annoying when you know what you want to do (especially when the original question isn't actually answered yet it comes up in search engine results...).

I can't say if it is a good reason in general, perhaps it is, but it certainly is something i personally find annoying. I think answers should provide an answer to the question asked and then, after that answer was given, they could also give pointers for whatever they consider a better approach and why - this is important, IMO, for a public forum where people of all backgrounds and goals can read the same stuff.

But either way, LLMs IMO should do/provide what they are asked without trying to second guess the user (or at least, there should be LLMs that act like that).

That’s my experience as well, if it’opus push back, it’s usually an actual issue with the code or prompt
FWIW i haven't used Claude or any other cloud-based LLM, only what i can run on my PC, so it could be that Claude is smart enough to follow the user's instructions, keep the equivalent of a mental state of what the user seems to want to do and only push back when it really makes sense whereas a small local LLM is too stupid to judge all that and Qwen3.6 errs on the side of being annoyingly cautious while Devstral Small 2 errs on the side of trusting the user being really okay with blowing their toes off :-P. As i wrote in my original reply, this is my personal preference and i prefer the LLM to just do what i ask.
TBH sometimes i feel like i'm "emotionally attached" to Mistral's models because i always end up using them :-P. However that is because, as you wrote, their small models (i only use local stuff) are very strong. In fact i was trying Qwen3.6 27B recently and while it is nice that it can do tool calls during the reasoning process (i had it confirm its thoughts by writing Python code) it often ended up confusing itself (regardless of tool calls) during reasoning, ending up in loops where it questions itself over and over endlessly.

Devstral Small 2 however just works, for the most part. Qwen3.6 27B can probably handle more complex tasks (when i asked it as a test to write a function that checks for collision between two AABBs in C and gave it a tool to call Python code for confirmation, it actually wrote a Python script that writes C code with the tests, then calls GCC to compile the C code and runs the binary to run the tests, which is something Mistral's small models couldn't do) but i always felt i can just leave DS2 doing stuff in the background (or when i'm doing something else) and it'll produce something relatively useful whereas the little time i spent with Qwen3.6 27B it felt more "unstable" (and much slower, both because of literally slower inference and because of endless reams of text).

Recently i also started using Ministral 3B and 14B - these can do some reasoning too and for very simple stuff Ministral 3B is very fast (i actually didn't expect a 3B model to be anything more than novelty) and have some vision abilities (though they're quite mediocre at vision so i haven't found much use for this - passing something via GLM-OCR to extract all text and feed it to another model feels more practical).

Also as i wrote in another comment, every Mistral model i've tried never questioned me, which i certainly prefer

> Most people fall for what I call the SOTA Logical Fallacy: whenever there ...

I think you'll find that ML now pretty much IS the HPC market, there's no distinction anymore. And the HPC market has always had the "being #1 gets you 99% of all business", even if #2 is only 10% behind SOTA.

Given what it's used for (ie. military applications, incl. nuclear weapons, but also rocket designs, flight planning, large-scale simulations), this is probably justifiable: part of it is states keeping in mind what the second prize in a war is worth ...

For certains tasks that are not hard but depend a clear specification, it's even better to haver less capable model because it forces you to do a better description of what you want, ending up with a better results. I will defend my PhD thesis soon and I will buy a yearly Mistral subscription at a student price to get it for cheap.