| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aithrowaway1987 653 days ago

> Recent examples I've seen fall well within the range of innumeracy that people routinely display.

Here's GPT-4 Turbo in April botching a test almost all preschoolers could solve easily: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

I have not used LLMs since 2023, when GPT-4 routinely failed almost every counting problem I could think of. I am sure the performance has improved since then, though "write an essay with 250 words" still seems unsolved.

The real problem is that LLM providers have to play a stupid game of whack-a-mole where an enormous number of trivial variations on a counting problem need to be specifically taught to the system. If the system was capable of true quantitative reasoning that wouldn't be necessary for basic problems.

There is also a deception is that "chain of thought" prompting makes LLMs much better at counting. But that's cheating: if the LLM had quantitative reasoning it wouldn't need a human to indicate which problems were amenable to step-by-step thinking. (And this only works for O(n) counting problems, like "count the number of words in the sentence." CoT prompting fails to solve O(nm) counting problems like "count the number of words in this sentence which contain the letter 'e'" For this you need a more specific prompt, like "First, go step-by-step and select the words which contain 'e.' Then go step-by-step to count the selected words." It is worth emphasizing over and over that rats are not nearly this stupid, they can combine tasks to solve complex problems without a human holding their hand.)

I don't know what you mean by "10 years ago" other than a desire to make an ad hominem attack about me being "stuck." My point is that these "capabilities" don't include "understands what a number is in the same way that rats and toddlers understand what numbers are." I suspect that level of AI is decades away.

1 comments

famouswaffles 653 days ago

Your test does not make any sense whatsoever because all GPT does when it creates an image currently is send a prompt to Dalle-3.

Beyond that LLMs don't see words or letters (tokens are neither) so some counting issues are expected.

But it's not very surprising you've been giving tests that make no sense.

link