| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by llm_trw 592 days ago

These benchmarks are entirely pointless.

The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.

What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.

Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.

4 comments

nopinsight 592 days ago

> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

llm_trw 592 days ago

Yes, they also fail. I've found the original gpt4 to be the most consistent. One of these days I'll spend the couple of thousands needed to benchmark all the top models and see how they actually perform on a task which can't be gamed.

nopinsight 592 days ago

What kinds of problems in what domains did you test o1 models with?

I found that they are good at logic and math problems but still hallucinate. I didn’t try to stretch test them with hard problems though.

llm_trw 592 days ago

Finding a path between two vertices when given an itinerary of all the edges in a general graph, exactly what I said in the OP.

mkl 592 days ago

Did you try asking them to write a program to do it?

andrepd 592 days ago

GP is trying to test the ability of LLMs to perform mathematical tasks, not their ability to store geeks4geeks pages.

youoy 592 days ago

Not to mention that math proofs are more than graph trasversals... (Although maybe simple math problems are not) There is the problem of extracting the semantics of math formalisms. This is easier in day to day language, I don't know to what extent LLMs can also extract the semantics and relations of different mathematical abstractions.

benchmarkist 592 days ago

It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

mkl 592 days ago

Most humans can't solve these problems, so it's certainly possible to imagine a legitimate AGI that can't either.

aurareturn 592 days ago

But humans can solve these problems given enough time and domain knowledge. An LLM would never be able to solve them unless they get smarter. Thats the point.

It’s not about whether a random human can solve them. It’s whether AI, in general, can. Humans, in general, have proven to be able to solve them already.

mkl 592 days ago

I'm responding to this:

> It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

I think it is possible to achieve AGI without creating an AGI that is an expert mathematician, and that it is possible to create a system that can do FrontierMath without achieving AGI. I.e. I think failure or success at FrontierMath is orthogonal to achieving AGI (though success at it may be a step on the way). Some humans can do it, and some AGIs could do it, but people and AI systems can have human-level intelligence without being able to do it. OTOH I think it would be hard to claim you have ASI if it can't do FrontierMath.

aurareturn 592 days ago

I think people just see FrontierMath as a goal post that an AGI needs to hit. The term "artificial general intelligence" implies that it can solve any problem a human can. If it can't solve math problems that an expert human can, then it's not AGI by definition.

I think we have to keep in mind that humans have specialized. Some do law. Some do math. Some are experts at farming. Some are experts at dance history. It's not the average AI vs the average human. It's the best AI vs the best humans at one particular task.

The point with FrontierMath is that we can summon at least one human in the world who can solve each problem. No AI can in 2024

mkl 592 days ago

Okay, sounds like different definitions.

If you have a single system that can solve any problem any human can, I'd call that ASI, as it's way smarter than any human. It's an extremely high bar, and before we reach it I think we'll have very intelligent systems that can do more than most humans, so it seems strange not to call those AGIs (they would meet the definition of AGI on Wikipedia [1]).

[1] https://en.wikipedia.org/wiki/Artificial_general_intelligenc...

llm_trw 592 days ago

It is very much an open question just what an llm can solve when allowed to generate an indefinite number of intermediate tokens and allowed to sample an arbitrary amount of text to ground itself.

There are currently no tools that let llms do this and no one is building the tools for answering open ended questions.

benchmarkist 592 days ago

That's correct. Thanks for clarifying for me because I have gotten tired with the comparison to "99% of humans can't do this" as a counter-argument to AI hype criticism.

mewpmewp2 591 days ago

AGI should be able to do anything the best humans can do. ASI is when it does everything better than the best humans.

pnut 591 days ago

Those thresholds look the same to me, personally.

An AI that can be onboarded to a random white collar job, and be interchangeably integrated into organisations, surely is AGI for all practical purposes, without eliminating the value of 100% of human experts.

campers 592 days ago

If an AI achieved 100% in this benchmark it would indicate super-intelligence in the field of mathematics. But depending on what else it could do it may fall short on general intelligence across all domains.

dr_dshiv 592 days ago

> they’re merely regurgitating memorized information

Source?

llm_trw 592 days ago

If a model can't inately reason over 5 steps in a simple task but produces a flawless 500 step proof you either have divine intervention or memorisation.

NitpickLawyer 592 days ago

AlphaGeometry has entered the chat.

Also, AIMOv2 is doing stage 2 of their math challenge, they are now at "national olympics" level of difficulty. They have a new set of questions. Last year's winner (27/50 points) got 2/50 on the new set. In the first 3 weeks of the competition the top score is 10/50 on the new set, mostly with Qwen2.5-math. Given that this is a purposefully made new set of problems, and according to the organizers "made to be AI hard", I'd say the regurgitation stuff is getting pretty stale.

Also also, the fact that claude3.5 can start coding in an invented language w/ ~20-30k tokens of "documentation" about the invented language is also some kind of proof that the stochastic parrots are the dismissers in this case.

llm_trw 592 days ago

I've not tested those models. Feel free to flick me through a couple of k in bitcoins if you'd like me to have a look for you.

firebaze 592 days ago

I'm not sure if it is feasible to provide all relevant sources to someone who doesn't follow a field. It is quite common knowledge that LLMs in their current form have no ability to recurse directly over a prompt, which inherently limits their reasoning ability.

dr_dshiv 591 days ago

I am not looking for all sources. And I do follow the field. I just don’t know the sources that would back the claim they are making. Nor do I understand why limits on recursion means there is no reasoning and only memorization.

light_hue_1 592 days ago

This is just totally false.

That's exactly what countless techniques related to chain of thought do.

llm_trw 592 days ago

The closest explanation to how chain of through works is suppressing the probability of a termination token.

People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.

dr_dshiv 592 days ago

It’s sometimes like, are these critics using the tools? It’s a strange schism at the moment.

llm_trw 592 days ago

It's my job to build these tools. I'm well aware of their strengths and shortcomings.

dr_dshiv 591 days ago

Unless you are building one of the frontier models, I’m not sure that your experience gives you insight on those models. Perhaps it just creates needless assumptions.

exe34 592 days ago

he just explained it to you.