Hacker News new | ask | show | jobs
by obblekk 538 days ago
Human performance is 85% [1]. o3 high gets 87.5%.

This means we have an algorithm to get to human level performance on this task.

If you think this task is an eval of general reasoning ability, we have an algorithm for that now.

There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.

Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!

[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1

10 comments

As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.

But, still, this is incredibly impressive.

Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity
My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.

For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.

And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.

I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.

I see two approaches for explaining the outcome: 1. Reasoning back on the result and justifying it. 2. Explainability - somehow justifying by looking at which neurons have been called. The first could lead to lying. E.g. think of a high schooler explaining copied homework. While the second one does indeed access the paths influencing the decision, but is a hard task due to the inherent way neural networks work.
> if we asked an LLM to produce an image of a "human woman photorealistic" it produces result

Large language models don't do that. You'd want an image model.

Or did you mean "multi-model AI system" rather than "LLM"?

It might be possible for a language model to paint a photorealistic picture though.
It is not.

You are confusing LLM:s with Generative AI.

Can an LLM use tools like humans do? Could it use an image model as a tool to query the image?
No, a LLM is a Large Language Model.

It can language.

I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.
That’s not inability to reason though, that’s having a social context.

Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.

What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.

Establishing context in psychology is hard.

o1 already fixed the red herrings...
LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.

'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.

Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.

LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.

I’d say, humans are also bound to promoting sessions in that way.

Last time I used ChatGPT 'memory' feature it got full very quickly. It remembered my name, my dog's name and a couple tobacco casing recipes he came up with. OpenAI doesn't seem to be using embeddings and a vector database, just text snippets it injects in every conversation. Because RAG is too brittle ? The same problem arises when composing LLM calls. Efficient and robust workflows are those whose prompts and/or DAG were obtained via optimization techniques. Hence DSPy.

Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.

Optimal phenomenological reasoning is going to be a tough nut to crack.

Luckily we don't know the problem exists, so in a cultural/phenomenological sense it is already cracked.

Current AI is good at text but not very good at 3d physical stuff like fixing your plumbing.
Does it include the use of tools to accomplish a task?

Does it include the invention of tools?

kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.

what are we humans getting ourselves into inventing skynet :wink.

its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.

>> Kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

Kinda interesting that mathematicians also can't do the same for mathematics.

And yet.

Mathematicians absolutely can, it's called foundations, and people actively study what mathematics can be expressed in different foundations. Most mathematicians don't care about it though for the same reason most programmers don't care about Haskell.
I don't care about Haskell either, but we know what reasoning is [1]. It's been studied extensively in mathematics, computer science, psychology, cognitive science and AI, and in philosophy going back literally thousands of years with grandpapa Aristotle and his syllogisms. Formal reasoning, informal reasoning, non-monotonic reasoning, etc etc. Not only do we know what reasoning is, we know how to do it with computers just fine, too [2]. That's basically the first 50 years of AI, that folks like His Nobelist Eminence Geoffrey Hinton will tell you was all a Bad Idea and a total failure.

Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".

I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.

_________________

[1] e.g. see my profile for a quick summary.

[2] See all of Russeel & Norvig, as a for instance.

[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.

well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

i doubt your mathmatician example is equivalent.

examples that are fresh on the mind that further my point. ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).

>> well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?

https://x.com/rao2z/status/1867000627274059949

Care to enlighten us with your explanation of what "reasoning" is?
I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.

The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.

If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.

> climate change is one of the most difficult and worst problems of our time.

Slightly surprised to see this view here.

I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).

None of those problems really matter if we don't have a planet to live on
You've been greviously mislead if you think climate change could plausibly make the world uninhabbitable in the next couple of centuries given current trajectories. I advise going to the primary sources and emailing a climate scientist at your local university for some references.
> going to the primary sources and emailing a climate scientist at your local university for some references

I assume you've done this, otherwise you wouldn't be telling me to? Bold of you to assume my ignorance on this subject. You sound like you've fallen for corporate grifters who care more about short-term profit and gains over long-term sustainability (or you are one of said grifters, in which case why are you wasting your time on HN, shouldn't you be out there grinding?!)

Severe weather events are going to get more common and more devastating over the next couple of decades. They'll come for you and people you care about, just as they come for me and people I care about. It doesn't matter what you think you know about it.

Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.
You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.

The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person

Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is
Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".
Pretty sure a waymo car drives better than an average SF driver.
And how well would a Waymo car do in this challenge with the ARC-AGI datasets?
Waymo cannot handle poor weather at all, average human can.

Being able to perform better than humans in specific constrained problem space is how every automation system has been developed.

While self driving systems are impressive, they don’t drive with anywhere close to skills of the average driver

Waymo blog with video of them driving in poor weather https://waymo.com/blog/2019/08/waymo-and-weather
There's a reason why Waymo isn't offered in Buffalo.
Is that reason because Buffalo is the 81st most populated city in the United States, or 123rd by population density, and Waymo currently only serves approximately 3 cities in North America?

We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.

If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.

And the brain doesn't use the same network to do verbal reasoning as real time coordination either.

But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.

Maybe, but no doubt these "dumb" people can still get dressed in the morning, navigate a trip to the mall, do the dishes, etc, etc.

It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.

The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.

Your examples are just examples of lack of information. That's not a measure for intelligence.

As a contrary point, most people think they are smarter than they really are.

There are things Chimps do easily that humans fail at, and vice/versa of course.

There are blind spots, doesn't take away from 'general'.

We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.
I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.
The downvotes should tell you, this is a decided "hype" result. Don't poo poo it, that's not allowed on AI slop posts on HN.
Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.
What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.
In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.

In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.

It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.
I wonder how much of an effect amount of time to answer has on human performance.
Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.
Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).
Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."
A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...

From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659

Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.

Why would an average human be more fair than a trained human? The model is trained.
It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.
Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?
NNs are not algorithms.
An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”

How does a giant pile of linear algebra not meet that definition?

It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.
If it wasn’t made of steps then Turing machines wouldn’t be able to execute them.

Further, this is probably running an algorithm on top of an NN. Some kind of tree search.

I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.

Let's see that Turing machines can approximate the execution of NN :) That's why there are issues related to numerical precision, but the contrary is also true indeed, NNs can discover and use similar techniques used by traditional algorithms. However: the two remain two different methods to do computations, and probably it's not just by chance that many things we can't do algorithmically, we can do with NNs, what I mean is that this is not just related to the fact that NNs discover complex algorithms via gradient descent, but also that the computational model of NNs is more adapt to solving certain tasks. So the inference algorithm of NNs (doing multiplications and other batch transformations) is just needed for standard computers to approximate the NN computational model. You can do this analogically, and nobody would claim much (maybe?) it's running an algorithm. Or that brains themselves are algorithms.
Computers can execute precise computations, it's just not efficient (and it's very much slow).

NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.

"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.

Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.

NN inference is an algorithm for computing an approximation of a function with a huge number of parameters. The NN itself is of course just a data structure. But there is nothing whatsoever about the NN process that is non-algorithmic.

It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.

We don’t have evidence that a TM can simulate a brain. But we know for a fact that it can execute a NN.
> It's not made of "steps", it's an almost continuous function to its inputs.

Can you define "almost continuous function"? Or explain what you mean by this, and how it is used in the A.I. stuff?

Well, it's a bunch of steps, but they're smaller. /s
Each layer of the network is like a step, and each token prediction is a repeat of those layers with the previous output fed back into it. So you have steps and a memory.
I would say you are right that function is not an algorithm, but it is an implementation of an algorithm.

Is that your point?

If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.

> continuous

So, steps?

"Continuous" would imply infinitely small steps, and as such, would certainly be used as a differentiator (differential? ;) between larger discrete stepped approach.

In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.

Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)

At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.

How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.

But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.

NN is a very wide term applied in different contexts.

When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.

We also call that a NN (the joy of natural language).

Running inference on a model certainly is a algorithm.
I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.
You don't think there are already plenty of attempts out there?

When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.

Do trading bots count?
No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.
I don't think humans ever do that. They research/read and ask other humans.
Which AI already has stored in spades, even more so since people in the 80's 90's weren't working with the information available today. The AI is free to research and read all the information stored from other humans as well, just like the humans who reasoned their way into money making opportunities--only with vastly more information now, talk about an advantage. But is it intelligent enough do so without a human giving direct/step-by-step instructions; the way humans figure it out?
It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%

This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.

This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.

It really calls into question two things.

1. You don't know what you're talking about about.

2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.

Either way, not a good look.

This