Hacker News new | ask | show | jobs
by mohsen1 89 days ago
Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.

This is the state of "AI" these days I guess...

[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

2 comments

The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.
Of course it is... we are in an era where a well-timed blog post showing "SOTA results" on a benchmark can net millions in funding