|
That makes me wonder if we could simply test this by letting the LLM add or multiply a long list of numbers? Here is an experiment: https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%... The correct answer: Correct: 20,192,642.460942328
Here is what I got from different models on the first try: ChatGPT: 20,384,918.24
Perplexity: 20,000,000
Google: 25,167,098.4
Mistral: 200,000,000
Grok: Timed out after 300s of thinking
|
You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.