| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tripletao 542 days ago

I feel like many people are reacting to the string "AGI" in the benchmark name, and not to the actual result. The tasks in question are to color squares in a grid, maintaining the geometric pattern of the examples.

Unlike most other benchmarks where LLMs have shown large advances (in law, medicine, etc.), this benchmark isn't directly related to any practically useful task. Rather, the benchmark is notable because it's particularly easy for untrained humans, but particularly hard for LLMs; though that difficulty is perhaps not surprising, since LLMs are trained on mostly text and this is geometric. An ensemble of non-LLM solutions already outperformed the average Mechanical Turk worker. This is a big improvement in the best LLM solution; but this might also be the first time an LLM has been tuned specifically for these tasks, so this might be Goodhart's Law.

It's a significant result, but I don't get the mania. It feels like Altman has expertly transformed general societal anxiety into specific anxiety that one's job will be replaced by an LLM. That transforms into a feeling that LLMs are powerful, which he then transforms into money. That was strongest back in 2023, but had weakened since then; but in this comment section it's back in full force.

For clarity, I don't question that many jobs will be replaced by LLMs. I just don't see a qualitative difference from all the jobs already replaced by computers, steam engines, horse-drawn plows, etc. A medieval peasant brought to the present would probably be just as despondent when he learned that almost all the farming jobs are gone; but we don't miss them.

1 comments

esafak 542 days ago

I think you did not watch the full video. The model performs at PhD level on maths questions, and expert level at coding.

tripletao 542 days ago

This submission is specifically about ARC-AGI-PUB, so that's what I was discussing.

I'm aware that LLMs can solve problems other than coloring grids, and I'd tend to agree those are likely to be more near-term useful. Those applications (coding, medicine, law, education, etc.) have been endlessly discussed, and I don't think I have much to add.

In my own work I've found some benefits, but nothing commensurate to the public mania. I understand that founders of AI-themed startups (a group that I see includes you) tend to feel much greater optimism. I've never seen any business founded without that optimism and I hope you succeed, not least because the entire global economy might now be depending on that. I do think others might feel differently for reasons other than simple ignorance, though.

In general, performance on benchmarks similar to tests administered to humans may be surprisingly unpredictive of performance on economically useful work. It's not intuitive at all to me that IBM could solve Jeopardy and then find no profitable applications of the technology; but that seems to be what happened.