|
|
|
|
|
by andai
6 days ago
|
|
Great article. I'm confused how Sonnet did worse than Haiku though. You mention it did find a bunch of other bugs, just not the ones you were looking for? 9 bugs is probably a bit low of a sample size to get a ranking. That being said the ranking does end up roughly how you'd expect. Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly. It's also decent for investigating large codebases. I wonder if it could do security work too. |
|
DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.
I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.