| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SwellJoe 14 days ago

I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.

And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.

There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).

4 comments

gwerbin 14 days ago

I don't think that's what these small models are for. They are for things like text summarization and generating a title for your AI session. Maybe Haiku occupies a weird zone where it's overpowered for those tasks but underpowered for anything more sophisticated. But for example I used it on an agentic reasoning task recently (reading a chunk of information and drawing a written conclusion, not writing code) and it did just fine. More powerful model would have been a waste of money.

link

SwellJoe 14 days ago

Sure, but it's priced higher than many better models. I'm not saying use the biggest models for everything. I'm saying Haiku is not a great deal as small models go. You can even self-host a model that is competitive if you've got a pretty beefy machine.

Haiku costs $1/$5. DeepSeek V4 Flash, a stronger model, is only $0.0028/$0.14/$0.28. That first number is the cached input, and DeepSeek caching is crazy efficient. So, using DeepSeek V4 Flash costs about an order of magnitude less than Haiku and performs better.

I have a Claude subscription because I'm willing to pay a premium for the best model for coding, one that doesn't waste as much of my time doing dumb stuff. But, if I need something other than Claude Code, I'm using something other than Claude models. Why burn money for no benefit?

Oh, also, Haiku chews tokens like crazy. In my benchmarks it used three times more tokens than the next highest model. Of course, security bug hunting is not in its wheelhouse, so it's not fair to judge it based on that one thing, but if it's more expensive per token and burns a lot more tokens, it ends up being a lot more expensive.

link

hadlock 14 days ago

I suspect the outrageous pricing of haiku/sonnet is offsetting the cost of opus. The value proposition a year ago was they were cheaper than opus, not that they're a fantastic value (which they're not)

link

not_kurt_godel 14 days ago

Haiku/Flash/small models are underpowered for literally anything where being non-false-positively correct on details matters at least like 25%. (That's not to say they are only correct 25% of the time, it's definitely more than that, but they're blatantly confidently wrong often enough that the wasted time is a significant net negative for me, even on relatively trivial tasks.)

link

SyneRyder 13 days ago

I don't suppose you've had a chance to benchmark MiniMax V3 yet? I've only just started testing other models after being an Anthropic fan. I haven't put MiniMax V3 to coding tasks yet, but something about my early simple tests has impressed me. The MiniMax API pricing is about 7% of Anthropic API prices (about matching Anthropic's subscription pricing).

link

SwellJoe 13 days ago

I haven't, but, probably will add it to the benchmark soon.

link

canpan 13 days ago

Same opinion. Opus is best for coding, but Qwen 3.6 27b Q8 is next, before Sonnet.

Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.

But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.

link

egeozcan 14 days ago

DeepSeek competes with Sonnet, not significantly worse or better. It tends to do weird things in codebases on the bigger side.

link

SwellJoe 14 days ago

At $3/$15, Sonnet is more than an order of magnitude more expensive than DeepSeek at $0.435/$0.87 (with cached input pricing of $0.003625, DeepSeek is very good at caching, so it's very cheap to use). So, if they're equal in performance, DeepSeek is ten times better value.

But, from what I can tell DeepSeek is better than Sonnet, though I agree it is not at the level of current Opus or GPT 5.5 (but I think it probably beats Gemini Pro 3.1). I use the best model I can for code, because the cost of weaker performance is more than the $100/month I pay for Claude Opus, but it's worth knowing there are very cheap, very good, models for stuff I want to do that isn't Claude Code.

link

egeozcan 13 days ago

I think there are so many variables from harnesses to tasks, making it very hard to put the models to a pecking order unless one beats another in virtually every task (like in Opus vs DeepSeek).

But all in all, I don't think we disagree.

link