|
|
|
|
|
by shawntan
249 days ago
|
|
I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI. I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation. The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper. |
|
I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?