| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sho 32 days ago

I am no-where near as concerned by this as I was a year ago, when I was expecting the axe to fall at any moment before the Chinese labs achieved some sort of escape velocity. I now think it's too late, all the cats are out of all the bags, there's no moat except maybe a temporal one of a few months, the genie is out of the bottle.

There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough. Deepseek 4 and Kimi 2.5 are not quite Claude 4.5/GPT5.5 but there's no fundamental principle missing - they are strong evidence that there's no real advantage the "frontier" labs possess that isn't related to scale, which they will gain in time (if they even need to). The RL post-training techniques that work are widely known and easily copied. All Deepseek is really lacking is data, which they're getting - and the harder Anthropic/the USG makes it to access claude in china, the more of that precious data they'll get!

I used to sort of entertain the "fast take-off breakaway" scenario as being plausible but not really anymore. The only genuine moat the frontier labs have is their product take-up, which isn't nothing, far from it, but it's not some unbreakable technological wall. Too late guys - it might have been too late for quite some time.

10 comments

gpt5 32 days ago

I wish it was true. I would gladly use a GPT 5.2 high model equivalent for coding (6 months old) if it was offered cheaper by Deepseek or Kimi. And I'm sure that's an extremely prevalent opinion by the millions of Claude and Codex users who are bothered by the costs.

However, they just don't perform that well in practice. That's the real issue. You can actually see it when you move away from open benchmarks. Deep seek 3.2 is 4% on Arc-AGI 2 [1], while GPT 5.2 high is 52% and GPT 5.5 pro high is 84.6%. That's the real reason why nobody is using these models for serious work. It's incredibly frustrating.

In addition, I already feel the pain myself on the model restriction. I'll asking my codex 5.5 agent to crawl a website - BOOM, cybersecurity warning on my account. I'll ask it to fix SSH on my local network - another warning. I'm worried about the day my account would be randomly banned and I cannot create a new one. OpenAI already asks you to perform full identification in order to eliminate these warnings - probably exactly for that - so that if they ban you, it's permanent.

[1] https://arcprize.org/leaderboard

usernametaken29 32 days ago

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

versteegen 31 days ago

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.

gpt5 32 days ago

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

snemvalts 31 days ago

What about other benchmarks? Benchmarks where the contents are freely available have become useless for evaluating models.

energy123 32 days ago

Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.

usernametaken29 32 days ago

My gut feeling is that ARC doesn’t play as big of a role in the Chinese model manufacturer landscape. It’s one byproduct but China is focusing on resource efficiency (for political reasons and low compute). So unlike OpenAI, poor performance on ARC doesn’t hurt as much if the model works well. OpenAI literally hinges on hype so the insane economic bets they make somehow pay off. If you have billions and the future of the company on the line, you ace the exam any way you can. We noticed this early on that whenever some dataset of ARC was released suddenly the classes of problems in that dataset GPT would do well on. But it just doesn’t generalise. They fine tune like crazy. I bet they fine tune for raspberry counting at this point. Again, for OpenAI the perception of moat is everything! Keep that in mind

zozbot234 32 days ago

True, ARC is mostly an artificial "human-like AGI" benchmark that doesn't really reflect any plausible workload. Very different from things like Humanity's Last Exam that reflect real-world knowledge and are now getting closer and closer to saturation even with open models.

applfanboysbgon 32 days ago

> Deep seek 3.2 is 4% on Arc-AGI 2

Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.

gpt5 32 days ago

Because Deepseek 4.0 is not yet there, but the jump isn't expected to be large. Kimi 2.5 is there and is also scoring low.

DCKing 32 days ago

Deepseek V4 came out three weeks ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.

Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.

noisy_boy 31 days ago

I have been using Deepseek v4 pro for personal projects and home infra related work for last couple of weeks. It's quality of work is not bad at all, it is fairly fast and given the fraction of the cost compared to Claude, I can keep going which makes it a very compelling option. Looking forward to trying out Kimi 2.6, thanks for the recommendation.

KronisLV 31 days ago

Also they have a pretty big token discount running this month: https://api-docs.deepseek.com/quick_start/pricing/

Even without the discount, I'll have to think about whether I need the 100 EUR tier of Anthropic Max, or whether downgrading to Pro and using DeepSeek is good enough. And they're also up on OpenRouter and other places.

Been using those models, not quite comparable with Opus 4.6/4.7 but with max reasoning, pretty good for a variety of dev tasks! Only big problem is no ability to process images, so can't really do browser use for some semi-automated testing, I'd have to write Playwright tests even when I don't want to.

DCKing 31 days ago

I've been using OpenCode Go ($10/month) for personal projects (I have Claude subscription for $DAYJOB) and for the tinkering around that I do for myself the quality of the open weight models and the limits of the OpenCode plan are sufficient. I agree that for a lot of dev tasks they're quite good!

stavros 31 days ago

I've been using Deepseek 4 Pro (instead of Sonnet 4.6) as the developer LLM (Opus is the planner) and it's been great. Not super fast, with all the reasoning, but has been writing good code, and I think I paid $5 so far (whereas with Sonnet I'd have run out of the weekly limits on Max for weeks now).

Definitely recommended, though it's crucial that you have GPT 5.5 review the code afterwards.

pjerem 32 days ago

Hum, I'm using it [0] with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.

[0]: https://ollama.com/library/deepseek-v4-pro

sho 32 days ago

I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.

The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.

What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.

And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.

About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.

flir 32 days ago

Just an observation: constraints often result in creative solutions. I wouldn't be surprised if a smaller lab makes a big breakthrough because they have to.

joefourier 31 days ago

> I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

Say what? 5 years ago OpenAI had received around $139 million in funding, and they’d just come out with GPT3 with 175B parameters, a 2048 context window, trained on 300B tokens on a 10,000 V100 cluster which would have cost maybe $4-13 million at the time for their training run.

Meanwhile Deepseek V3’s famously frugal training was $5M, and Chinese AI companies are raising billions in funding. Sure American AI companies are raising tens (and maybe hundreds in the case of OpenAI, if you count their circular funding rounds) of billions but they’re grossly inefficient, and we’ve already hit the limits of the scaling laws where there’s little point in increasing the number of parameters of a model.

Our_Benefactors 31 days ago

> Meanwhile Deepseek V3’s famously frugal training was $5M

And widely derided once the team was unable to provide receipts. It’s more likely to be 10x

gmerc 31 days ago

Why make up things? The papers are published completely and apples to apples compares 5M final training run against grok 3.5 (400M)final training run.

Our_Benefactors 31 days ago

Oh, it was written in a paper, must be correct then, no further investigation required just believe it at face value! No track record of academic dishonestly, and definitely no incentives to fudge the numbers.

ageitgey 32 days ago

Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.

It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.

I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.

PAndreew 32 days ago

I'm starting to feel like a parrot, but people seem to forget that software engineering is actually a very narrow slice of the white collar pie. You don't need a mega-model which can reason about 100 000 lines of code when you want to create a nice PPT (which consumed literally hours of your life before) to impress your boss. SOTA models will probably be used for frontier research, complex coding tasks, large scale data analysis, etc. And the average Joe shall be able to buy a pre-configured box with a plug-and-play harness and run medium models air-gapped. Or use such models through cloud APIs dirt cheap if privacy is not a concern.

ageitgey 32 days ago

On the same topic but from a slightly different angle - as SOTA models get more capable, the 'quality' and 'feel' of the experience they provide in each domain is heavily dependent on the reinforcement learning the vendor does for that specific domain. After all, many fields have 100 flavors of "good answers," but the model has to pick one answer.

Benchmarks are not very good at capturing this yet. But it could be the case that DeepSeek v4 Pro is 100% as good as Claude Opus 4.7 at scaffolding a basic Rails app, but absolutely terrible at creating a credible business plan that another businessperson would think is real. That's a made-up example, but you get the point.

The end result will be a lot of people arguing about which model is "better," but "better" depends heavily on the task and how that model was trained to interact with the user for that task. Two users may have very different qualitative experiences using the exact same model, despite the benchmarks.

zozbot234 32 days ago

Creating a nice PPT is actually hard because it requires visual capabilities and so-called "computer use" (really, GUI use) of fiddly proprietary software. The nice thing about the coding case compared to a lot of disparate white-collar work is that it's all plain ASCII text. You can already ask a coding model to create a nice TeX/beamer slideshow (or whatever the Typst-based equivalent is) but whether your boss will be duly impressed by that is anyone's guess.

m_mueller 31 days ago

Tangential, but in our opinion corporate PPTX automation is an unsolved problem, even with Claude for PowerPoint (and it's worse with everything else common out there). Its harness (a) is not tuned very well for corporate use and (b) even if it were, fails to manage the specific business knowledge within each org needed to create effective (i.e. audience tailored) presentations.

I've just written a blog post about this topic this week: https://octigen.com/blog/posts/2026-05-11-ai-presentation-ga...

nimonian 32 days ago

This is a tangent but I'd also mention sli.dev -- slideshow-as-website is really great and fun to make with llms

omnimus 32 days ago

Also so many developers i know use LLMs for one shoting isolated problems, explainers, discussions and planning. For these even Kimi is pretty great.

I don't think every dev will be comfortable just releasing claude on their project.

energy123 32 days ago

They're not even that much cheaper (1/2 price per task according to Artificial Analysis) once you account for lower token usage of GPT-5.5. I can't justify it when factoring in the extra time wasted, and the cheap codex usage I get through the monthly plan. Frontier intelligence is not a commodity product ... yet.

gruez 31 days ago

The price per task already factors in token usage so you're double counting if you're also tacking "higher token usage" as another argument on top

irthomasthomas 31 days ago

Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?

csomar 31 days ago

If you want something close to claude, use glm 5.1 with claude code. Their subscription price is no longer x10 times cheaper now though (at best 2 times cheaper)

otabdeveloper4 32 days ago

And yet Claude six months ago was amazing and good enough for you.

This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.

doctorwho42 31 days ago

Ah, AI is running off of the highway model, induced demand. That kind of makes a lot of sense now that I think about it.

TrackerFF 31 days ago

Which is why, I believe, the big AI companies are starting to focus and roll out vertical products more. They know that the models themselves aren't sticky, people can easily switch between different models with not much hassle.

I think the big AI companies are trying to transform into the next Microsoft. Completely capture both enterprise and consumers.

reeredfdfdf 31 days ago

"I think the big AI companies are trying to transform into the next Microsoft. Completely capture both enterprise and consumers."

That is going to be a failing strategy though. Whatever OpenAI or Anthropic implement, Microsoft and Google can trivially copy and provide to their existing customers that are already deeply invested in their platforms.

scotty79 32 days ago

> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough.

Over last year it seems that the only thing US labs are ahead is money spent. At least half of technical innovations if not more came from Chinese labs and was published openly.

nradov 31 days ago

Broad and deep capital markets are a real competitive moat for the USA. No other country or economic bloc can quickly deploy huge amounts of capital to new opportunities nearly as fast. China can work around that to an extent with a command economy that focuses resources on national strategic priorities but it's slower and less effective over the long term.

wiekke 31 days ago

Actually this will end up being the greatest disadvantage.

Pure spending power doesn’t give you the edge in tech. Creativity, and innovating under constraints leads to success.

nradov 31 days ago

Nah. If that was an actual disadvantage then the USA wouldn't already be the world leader in most technology sectors. Capital is only one of several constraints.

yorwba 32 days ago

All of the reasons in the article also apply to Chinese companies. If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?

The economic pressures are the same, too. Currently, Chinese models are offered for cheap or in some cases provide weights for free because that's the only way to gain traction. (That closed-weight releases by Baidu, Bytedance, iFlyTek etc. hardly generate any buzz bears that out, as does the fact that when Alibaba does a closed-weight release, someone always gets confused because they associate the Qwen brand with open models.) At some point, their investors are going to want profits, not just user counts. That means higher prices, or no more new models.

If there's no secret sauce and all you need is scale, that would actually be kind of the worst-case scenario for catching up to the frontier, since scaling is expensive and the frontier model companies have easier access to capital as well as higher revenues.

zozbot234 31 days ago

> If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?

They aren't trying to become that good, nor do they need to in order to have real positive impact. Models like Mythos are estimated to be humongous even on a datacenter-wide scale, which is actually a big factor in its limited availability at present. It's mostly helpful as a one-of-a-kind proof of concept, to answer the question of whether AI can still plausibly scale by growing capabilities and what happens to alignment concerns when you do that.

yorwba 31 days ago

I expect every company to try to make a model as good as they possibly can, especially now that Mythos has served as a proof of concept to demonstrate that there's lots of interest in AI for cybersecurity. But if they don't try, that hardly assuages concerns about not being able to access the very best models, does it?

hbarka 32 days ago

Harness engineering is a moat. There’s user loyalty and reliance on the chassis that Claude is on, for example, just like there’s more market share by MacOS+WindowsOS over Linux Open Source.

kasey_junk 32 days ago

I regularly switch between codex and Claude in the same sessions. I’d throw in other models if I could.

Data governance and enterprise sales is a moat. The harnesses aren’t.

PunchyHamster 32 days ago

The industry on tooling have been very much moving in direction of "plug the AI of your choosing" for a while now, and given how much Anthropic fights the 3rd party tools they are definitely afraid to be left in the dust.

> just like there’s more market share by MacOS+WindowsOS over Linux Open Source.

It's hard to change OS. It's not hard to jump from one AI tool to another

saberience 31 days ago

It's absolutely NOT a moat. Making a harness is the EASY part.

If you had said "marketing is a moat" then yes, I would say you were right. But creating a harness equal to or better than Claude Code is trivial. The CC harness is actually shit. There are tons of open-source harnesses than work better than CC while using Opus via OpenRouter.

ElFitz 32 days ago

I thought so too.

But 1) people use other models with that same harness. 2) I moved on from Claude Code and all the features I cared for up and running in less than a couple days. Without even looking for available plugins or extensions.

thepasch 32 days ago

> Harness engineering is a moat.

I mean, if that’s the case, then Anthropic themselves are currently actively filling in that moat with nice, solid, walkable dirt. Claude Code may have been a moat 6 months ago but these days you’ll want to replace the “m” with a “bl”.

BrtByte 32 days ago

I agree the genie is out of the bottle technologically. I'm less convinced that means access stops being politically and economically important. The bottle may be gone but the best lamps are still expensive

trollbridge 32 days ago

But a “good enough” lamp just got a lot cheaper. The cost of tokens on DeepSeek V4 Pro is so low I don’t even think about and currently am trying to figure out useful things for as many agents simultaneously running as I can. What would have cost $150 less than a year ago now costs 35¢.

Likewise Qwen 3.6 absolutely blows me away and that’s on a 35b 6-bit model on a local 5090. Same thing, busy trying to find stuff to do to keep it busy 24/7.

I can still find some niches for Opus 4.7 but being able to attack problems and not worry about consumption is a game changer.

jorvi 32 days ago

Virtually no one is going to pay for the best performing lamp if the next best lamp does 90% as good for an order of magnitude cheaper.

I will say, as pointed out by others, DeepSeek and other Chinese providers still lack a bit in the tooling that Claude has, but they'll get there.

Paradigma11 32 days ago

That presumes that there is a linear scale that measures performance. This can be tested: https://en.wikipedia.org/wiki/Rasch_model

Even assuming this holds, what utility you gain by the best models depend completely by your workload. If you have tasks that require performance 10 and DeepSeek has 9, you will gladly pay for SotA models.

baq 32 days ago

And yet it seems that 90% are happily paying for the marginal 10% capability and saturate datacenters.

lmm 32 days ago

Happy to pay for? Or happy to spend other people's money on?

baq 31 days ago

somebody is happy to spend that money

lugu 32 days ago

That is called marketing.

baq 31 days ago

not necessarily. it might just as well be 'time is money'.

BrtByte 32 days ago

If the second-best lamp is 90% as good and 10x cheaper, most people will use the second-best lamp...

avazhi 32 days ago

That’s what he said?

moffkalast 31 days ago

I would agree, the only thing Kimi is really missing is stability and harness training, For general chat tasks I consider it mostly on par. Occasionally I'll give the same problem to Kimi, Claude, GPT, Gemini and it's not unusual to see Kimi correctly figure out some kind of weird extra thing that the others missed, like some kind of mentally unstable savant.

shevy-java 32 days ago

> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough

This is not just about mainland China though. The current US government is extremely selfish and self-centered. Other countries really need to consider for their own long-term situation here.

ElFitz 32 days ago

> The only genuine moat the frontier labs have is their product take-up

And even then, their is no stickiness. For most use cases there isn’t much value in one frontier model over the other.

Just have to look at the people flocking from one to the other for whatever reason.

baq 32 days ago

I’m flocking from GPT to opus every week for the past 3 months and always come back.

The point isn’t that gpt is better, it’s that it is so much better for my work it isn’t even sticky, it’s reinforced concrete. I use opus 1% of the time because it writes better and it’s sticky there.

Yes I’ll switch approximately immediately if opus or Gemini (which I use more than opus!) is better for what I do, but at this point frontier model tokens are not fungible.

ElFitz 32 days ago

There will always be dataset and training quirks, and the provider’s own biases and focus, granting one model an edge over the others in some specific domain.

baq 32 days ago

Yup and that’s where the moats are.

dotancohen 32 days ago

The large AI houses arguably ensure that model switching be a natural action for their clients, by switching the default model of their flagship offerings every few months. Such is the price of progress.

nojs 32 days ago

What about access to GPUs and memory? This is becoming a pretty major bottleneck.

repelsteeltje 32 days ago

Today's tech echoes 1960-1970 mainframe era: very centralized around a handful of companies controlling "massive cloud compute" in bespoke mainframe-like topology.

All of that will all be legacy in a couple of years. Today's B200 clusters are tomorrow's e-waste. Decentralization might happen gradually or abruptly. But to me it's obvious that we'll be thinking of high-tech tensor processors and GPUs the way we thought of individual transistors and tube amplifiers in the 1980s.

If AI turns out to be the revolution it purports to be, than the underlying hardware will change much more rapidly than it did with ICs and microprocessors in the late 1970s. Today's hot is tomorrow's junk.

zozbot234 31 days ago

> Today's B200 clusters are tomorrow's e-waste.

Hardware depreciation timescales are actually getting longer, not shorter, because frontier hardware like B200 clusters is highly bottlenecked. It's not just a RAMpocalypse out there, we're seeing early signs of production bottlenecks with GPUs and maybe even CPUs.

doctorwho42 31 days ago

Which, in itself, is a major crack that AI has caused in the delicate foundation of our technological society.

aurareturn 31 days ago

One thing that is potentially different this time is that Moore's Law has stopped scaling. Computers aren't getting smaller exponentially. They're getting bigger with multiple chips glued together to make up for Moore's Law.

repelsteeltje 31 days ago

...But there's a new world dawning for photonic chips.

No reason to expect Moore's observation to apply there (though, maybe?), but it will have big implications for power usage.

aurareturn 31 days ago

Photonic chips allow computers to get bigger, not smaller.

wokkel 32 days ago

It's basically converted sand. Most of that conversion happens in Taiwan at the moment. Which is considered, by China, to be one of their provinces and as a protectorate by the usa. Hence the interest in that region....

asdff 32 days ago

Everyone is expecting them to invade Taiwan, but why not merely extort Taiwan?

littleparrot 32 days ago

You mean by contributing to RAMpocalypse the mainland incentives the west to build own fabs, making Taiwan expendable for us someday?

asdff 31 days ago

West has been incentivized to build their own fabs for years but still fumbles that effort. All the billions spent hardening the south china sea and taiwans chip manufacturing from the future chinese invasion would have probably paid for a lot of manufacturing capacity stateside.

zozbot234 31 days ago

Mainland China is growing its own RAM manufacturing capacity. They are too tiny to make a real dent into the RAMpocalypse yet but this can potentially change.