Hacker News new | ask | show | jobs
by pmayrgundter 757 days ago
I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.

Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.

For example with arithmetic, this study cites another [Dziri et al. 2023], that says:

"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."

But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.

DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40

Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.

4 comments

Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.

LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.

Agreed on the first part, but for LLMs not having correlated capabilities, I think we've seen they do. As the GPTs progress, mainly by model size, their scores across a battery of tests goes up, eg OpenAI's paper for ChatGPT 4, showing a leap in performance across a couple dozen tests.

Also found this, a Mensa test for across the top dozen frontier models.

https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

That does seem to me to be demonstrating a global type of reasoning or generalization.

Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.

AGI is like consciousness, 75% of the people in any given conversation are talking about different things.

Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.

At least with AGI, we can just throw up our hands, use an easier definition and take the W.

I don't understand the framing of your comment. You act like the LLM's feelings are going to be hurt if you say it isn't a real AGI. "Well, you can't do basic math expected of fifth graders, but there are dumb fifth graders too, so here's the 'human-level intelligence' participation trophy anyway."
> But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

This nitpicking is a red herring.

The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)

In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.

Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.

Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)

And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.