| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andybak 45 days ago
	> draw a pretty good pelican on a bike. You mean the famously hard task? The one picked because it stretches frontier models to their limits?

3 comments

munk-a 45 days ago

It was a famously hard task. It was an ingenious idea for an unexpected task that falls outside of the bounds of predictable normal input but is still readily comprehended by the public.

Unfortunately, as soon as it's a famously hard task trainers know they need to succeed at it and it loses a lot of the power to detect correctness.

link

quantummagic 45 days ago

In fairness, that isn't due to a lack of compute.

link

daveguy 45 days ago

https://simonwillison.net/2026/Apr/22/qwen36-27b/

Maybe this is an example of training overfit. But it won't be too long before local models chew through the "famously hard tasks". Except possibly ARC-AGI. That's one benchmark that is still developing with capabilities. And every time a new ARC-AGI benchmark is released it make the SOTA LLMs look pathetic. Because there is very little understanding or transferability with LLMs. But in terms of benchmark-able micro tasks, the local LLMs are improving.

link