| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saberience 384 days ago

They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.

3 comments

CamperBob2 383 days ago

They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions

That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.

link

nwienert 383 days ago

Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.

link

CamperBob2 383 days ago

ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.

They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.

Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.

link

nwienert 383 days ago

I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.

link

Tostino 383 days ago

The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.

link

theshrike79 383 days ago

I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.

But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.

With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`

link

djmips 384 days ago

Well try it again and report back.

link