Hacker News new | ask | show | jobs
by nopinsight 592 days ago
> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

1 comments

Yes, they also fail. I've found the original gpt4 to be the most consistent. One of these days I'll spend the couple of thousands needed to benchmark all the top models and see how they actually perform on a task which can't be gamed.
What kinds of problems in what domains did you test o1 models with?

I found that they are good at logic and math problems but still hallucinate. I didn’t try to stretch test them with hard problems though.

Finding a path between two vertices when given an itinerary of all the edges in a general graph, exactly what I said in the OP.
Did you try asking them to write a program to do it?
GP is trying to test the ability of LLMs to perform mathematical tasks, not their ability to store geeks4geeks pages.
Not sure why you're being downvoted that is exactly why I'm using that simple problem to benchmark LLMs. If an LLM can't figure out how to traverse a graph in its working memory then it has no hope of figuring out how to structure a proof.

Under natural deduction all proofs are sub trees of the graph which is induced by the inference rules from the premise. Right now LLMs can't even do a linear proof if it gets too long when given all the induced vertices.