| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by usernametaken29 40 days ago
	I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

3 comments

versteegen 39 days ago

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.

link

gpt5 40 days ago

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

link

snemvalts 39 days ago

What about other benchmarks? Benchmarks where the contents are freely available have become useless for evaluating models.

link

energy123 40 days ago

Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.

link

usernametaken29 40 days ago

My gut feeling is that ARC doesn’t play as big of a role in the Chinese model manufacturer landscape. It’s one byproduct but China is focusing on resource efficiency (for political reasons and low compute). So unlike OpenAI, poor performance on ARC doesn’t hurt as much if the model works well. OpenAI literally hinges on hype so the insane economic bets they make somehow pay off. If you have billions and the future of the company on the line, you ace the exam any way you can. We noticed this early on that whenever some dataset of ARC was released suddenly the classes of problems in that dataset GPT would do well on. But it just doesn’t generalise. They fine tune like crazy. I bet they fine tune for raspberry counting at this point. Again, for OpenAI the perception of moat is everything! Keep that in mind

link

zozbot234 40 days ago

True, ARC is mostly an artificial "human-like AGI" benchmark that doesn't really reflect any plausible workload. Very different from things like Humanity's Last Exam that reflect real-world knowledge and are now getting closer and closer to saturation even with open models.

link