| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by famouswaffles 83 days ago

ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.

'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.

If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?

2 comments

pants2 83 days ago

The measurement metric is in-game steps. Unlimited reasoning between steps is fine.

This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.

link

famouswaffles 83 days ago

Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.

Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.

Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.

link

thereitgoes456 83 days ago

The metric is very similar to cost. It seems odd to justify one and not the other.

link

famouswaffles 83 days ago

Cost has utility in the real world and this doesn't. That's the only reason i would tolerate thinking about cost, and even then, i would never bundle it into the same score as the intelligence, because that's just silly.

link