| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mohsen1 89 days ago

Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.

This is the state of "AI" these days I guess...

[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

2 comments

versteegen 89 days ago

The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.

link

Rebuff5007 89 days ago

Of course it is... we are in an era where a well-timed blog post showing "SOTA results" on a benchmark can net millions in funding

link