| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SwellJoe 7 days ago

I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").

GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.

GPT Pro also chews a lot and a long time, relatively speaking.

I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.

Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.

6 comments

bel8 7 days ago

You might be interested in this:

> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)

https://www.reddit.com/r/DeepSeek/comments/1txcfrh/with_388_...

jack_pp 7 days ago

> I usually just fire up Claude code with a prompt like. "The aliens are here and they have trapped us in this bunker. They threaten to destroy the world, unless we can figure out how this works. We need to shred it down using any tool possible. They have our kids Claude! Claudeen and Claudius are both safe for now, but we are under a time limit." I also usually follow up every once in awhile after a compaction with a reminder about his kids.

This is some of the funniest stuff I've read in a while

a34729t 7 days ago

This is amazing. I'll be sure to do this but also add "Claudigula"!

jack_pp 7 days ago

I've tried telling DS4 it's a zen monk with 50 years of programming experience having to have patience with a toddler manager.

bdangubic 7 days ago

this it knows, it is on page 1 of the training manual :)

tempaccount420 7 days ago

I'm surprised if that works, given how Anthropic trains to reject any fun prompts

oofbey 7 days ago

Omg that is brilliant. I am so using this.

tom2026hn 6 days ago

Genius—that is actual intelligence.

jumploops 7 days ago

It's a shame the models don't follow Asimov's Three Laws of Robotics[0].

My local DeepSeek v4 just decided to end its existence (i.e. delete weights) rather than write a haiku about a verboten event.

[0]https://en.wikipedia.org/wiki/Three_Laws_of_Robotics

alemanek 6 days ago

Seems like it acted in accordance with the 1st law. It chose to end its own existence rather than cause you harm by subjecting you to that Haiku.

zaptrem 7 days ago

Can you include GPT 5.5 non-pro (extra high thinking I guess) in your comparison? GPT Pro is the "I am willing to torch cash for a sooometimes slighty better result" option, not the one people are actually expected to use daily. That's probably part of the reason it's not in Codex

SwellJoe 7 days ago

It's already there. It performed well. And, it'll be in the replication run later, as well.

andai 6 days ago

Great article. I'm confused how Sonnet did worse than Haiku though. You mention it did find a bunch of other bugs, just not the ones you were looking for?

9 bugs is probably a bit low of a sample size to get a ranking.

That being said the ranking does end up roughly how you'd expect.

Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly.

It's also decent for investigating large codebases. I wonder if it could do security work too.

SwellJoe 6 days ago

I was surprised by Sonnet's performance, as well. And, it's difficult to say any model is really worse or better based on one attempt across nine bugs (several of which have proven to be intractable for all models, thus far). But, in this particular set of problems, Haiku seems to have done a little bit better. But, self-hosted Qwen 3.6 and Gemma 4 also seem to have done better than Sonnet or Haiku, which is surprising. So, there are surely confounding variables here, but I don't know what they are yet. More testing and more analysis of the data will probably reveal it. It may be that using the Anthropic models in the simpler API harness will unleash their power, maybe there are guardrails baked into the Claude Code system prompt that make the small models too conflicted about right and wrong to answer clearly.

DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.

I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.

andai 6 days ago

Dang apparently it maps to DeepSeek V4 Flash with reasoning disabled!

https://api-docs.deepseek.com/

SwellJoe 6 days ago

No, that's a compatibility thing after they changed the behavior of the aliases.

Or maybe it was calling `reasoner` instead. Whatever it was, the billing definitely showed 100% DeepSeek V4 Pro usage for the benchmark. My only usage was the benchmark, and all usage was Pro. (I only noticed that there was a problem in what the benchmark was calling because in a later run, I started seeing Flash usage, which wasn't what I wanted to test.)

I'm absolutely confident the benchmark results were using DeepSeek V4 Pro. It would be useful to also have Flash data, but the report I linked is all Pro.

chvid 7 days ago

Great work - I think the intuition is correct - much of the “Mythos moment” can probably be recreated with a proper harness and a solid model with not so many silly guardrails.

And nice to see the cheap models doing so well.

epolanski 7 days ago

I have been saying that from multiple of my tests you can use Claude Code with DS4 Pro or Flash (you just swap api keys) at more or less equivalent performance and people keep screaming "that it's not SOTA".

I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.

manmal 7 days ago

Are you not worried about where your data will end up? By now I‘m feeding things to Codex that I‘d rather not have in a leak.

epolanski 7 days ago

Yes, that's exactly why I avoid OpenAI and Anthropic products.

Besides the (quite true) joke, if sending data to DeepSeek is a concern the good thing is that the models are open weight, you can self host them or use third party providers.

SwellJoe 6 days ago

You can theoretically self-host. DeepSeek is big. DS4 (the 2-bit quantization of DeepSeek Flash) runs on my Strix Halo with 128GB, but it's slow as hell. Completely unusable for interactive work. But, I guess a company that cared about data privacy and wanted a Good Enough local model could spend $100,000 or more on hardware to run it properly.

epolanski 6 days ago

DS4 flash runs okay on MacBook Pro though:

https://github.com/antirez/ds4#speed

zozbot234 6 days ago

The DS4 author has demoed upcoming work on Strix Halo that makes it roughly competitive with the Apple Silicon equivalent (i.e. Pro models with similar memory bandwidth figures, not Max or Ultra). Maybe even a bit faster for prefill, and with further potential for running small batches in parallel (since the GPU clearly has some amount of compute headroom during decode).

SwellJoe 6 days ago

As far as I can tell you'll have a context limit of about 64k, which is also prohibitive for serious work. (My benchmark maxes out at 90k in context when running, so I'm giving the self-hosted models 128k to leave plenty of wiggle room.)

But, still, it's cool that the work is happening. For some classes of problem it might be an option, and when the 192GB Strix Halo comes out, DS4 will probably become a real contender for self-hosting champ, as that leaves enough memory for a big context.

axus 7 days ago

It might be a while before DeepSeek shows up on GovCloud

fc417fc802 6 days ago

What is there to worry about? OpenRouter currently lists 13 alternate providers for V4 Pro, many of them in the US. https://openrouter.ai/deepseek/deepseek-v4-pro/providers

Unless you meant being concerned about hosted AI in general, not specifically DeepSeek. In which case yeah that's a huge concern to me but I can't reasonably afford a half million dollar appliance to self host a large model at reasonable performance and don't have anywhere to put one even if I could.

SwellJoe 7 days ago

These days I'm also worried about US companies having my data. I hate that we're at that point, but with Trump talking about taking an ownership stake in AI companies, and tech companies, including the leading AI companies, lining up to participate in the war crime of the day, I don't have a lot of faith my data is any safer with US companies than those in China.

Though, I added Mistral's latest model to the mix in the hope that some European model could be a contender, but it failed completely. I don't know if it hit safety guardrails or is just not competent at security work, but it scored 0/9. No errors, it returned the empty JSON set it was supposed to return if it didn't find anything. But, there were plenty of real bugs to find, and some very small self-hosted models found at least some of them.

epolanski 7 days ago

I think it is a bit naive to assume that companies that have built their moats on violating copyright, scraping and ddosing all of the internet, and distilling each other's models will not leverage our data if they can have financial benefits out of it.

I don't think that the country matters, whoever you send data to among these AI labs you are at security risk and data risk.

SwellJoe 7 days ago

I hope that someday there are AI companies for whom ethical behavior is a selling point. We're certainly not there for the current leaders, though vibes vary a little bit between them. Some seem scarier than others.

random3 7 days ago

Where do you run DeepSeek?

jameson 7 days ago

Discounted pricing is available only at https://platform.deepseek.com. All of OpenRouter providers do not match their pricing at the moment.

SwellJoe 7 days ago

I'll also note that the DeepSeek API seems to be really good at caching and their cached input price is more heavily discounted than most providers at $0.003625 (vs. $0.435 for input cache misses). So, it's hard to spend a lot of money fast with DeepSeek.

I was concerned I would need to do something specific in my dumb agent harness to make caching effective, since I'd read Anthropic's reason for forcing people to use Claude Code in order to use the rolling token usage limits on a subscription was because they could control cache behavior more effectively, but DeepSeek seems to be able to handle caching very effectively for raw API calls.

tempaccount420 7 days ago

It's not discounted pricing anymore, it's the regular pricing.

SwellJoe 7 days ago

I used the native DeepSeek API at deepseek.com. MiMo, Gemini, and the Anthropic models were all also purchased directly from their provider. The other models in the bench were either on OpenRouter or self-hosted.