| These benchmarks are entirely pointless. The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet. What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules. Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information. |
This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?