| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Alifatisk 150 days ago
	Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?

4 comments

conception 150 days ago

https://clocks.brianmoore.com

K2 is one of the only models to nail the clock face test as well. It’s a great model.

link

culi 150 days ago

Kimi 2 is remarkably consistently the best. I wonder if it's somehow been trained specifically on tasks like these. It seems too consistent to be coincidence

Also shocking is how the most common runner up I've seen is DeepSeek

link

michaelcampbell 149 days ago

It's better than most, but not 100%. As I see this the clock hands are all correct, but the numbers only go 1-8.

link

DJBunnies 150 days ago

Cool comparison, but none of them get both the face and the time correct when I look at it.

link

conception 150 days ago

Refresh. It’s not every time but k2 hits a perfect clock for me about 7/10 or so.

link

WarmWash 150 days ago

They distill the major western models, so anytime a new SOTA model drops, you can expect the Chinese labs to update their models within a few months.

link

zozbot234 150 days ago

This is just a conspiracy theory/urban legend. How do you "distill" a proprietary model with no access to the original weights? Just doing the equivalent of training on chat/API logs has terrible effectiveness (you're trying to drink from a giant firehose through a tiny straw) and gives you no underlying improvements.

link

Alifatisk 150 days ago

Yes, they do distill. But just saying all they do is distill is not correct and actually kind of unfair. These Chinese labs have done lots of research in this field and publish it to the public, some of not majority contribute with open-weight models making a future of local llm possible! Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).

They are not just leeching here, they took this innovation, refined it and improved it further. This is what the Chinese is good at.

link

Balinares 150 days ago

Source?

link

esafak 150 days ago

They are, in benchmarks. In practice Anthropic's models are ahead of where their benchmarks suggest.

link

HNisCIS 150 days ago

Bear in mind that lead may be, in large part, from the tooling rather than the model

link

zozbot234 150 days ago

The benching is sus, it's way more important to look at real usage scenarios.

link