| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joshrw 6 hours ago
	Chinese models optimize for benchmarks and do poorly in real-world tasks

1 comments

epolanski 5 hours ago

Not my experience at all, I have written about comparing DS4 vs Opus 4.8 on 16 real life work tasks on multiple posts.

Also, every single lab does RL on benchmarks, which is why Opus 4.6 was the last truly great assistant, after it, all models tend to drift into implementation asap.

link

jameswhitford 4 hours ago

Hi, author here, can you link? I would love to read about this.

link

epolanski 4 hours ago

https://news.ycombinator.com/item?id=48584034

link