Hacker News new | ask | show | jobs
by starfezzy 515 days ago
Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Prompts like, "Give me five odd numbers that don't have the letter 'e' in their spelling," or "How many 'r's are in the word strawberry?"

I suspect the breakthrough won't be trivial that enables solving trivial questions.

4 comments

> Give me five odd numbers that don't have the letter 'e' in their spelling

Compare the reasoning times!!! 84s vs 342s

R1 (Thought for 84 seconds)

      No odd number in English avoids the letter 'e' in its spelling. The request for five such numbers cannot be fulfilled.
o1 Pro (Thought for 5 minutes and 42 seconds)

      No standard English spelling of an odd number can avoid “e.” Every odd digit (one, three, five, seven, nine) already includes “e,” so once you build any odd number out of those digits, the spelled‐out form will contain “e.” As a result, there are no such odd numbers in standard English without an “e” in their name.
Even after five minutes, the logic has a (small) hole.

Not all odd spell out the digit: thirteen, fifteen (and maybe eleven?) are odd but don’t have an ‘e’ in the word’s “digit” part, though they obviously do elsewhere.

If the answer is supposed to be logically rigorous, o1 is incomplete for the reason you say.

If I was treating this as a lateral thinking puzzle (does anyone still do those for interview questions?) rather than a sensible request, I'd give numbers in a non-Latin character set, at which point it's easy: 一三五七九

(But even this only works for silly games, IMO).

4o: https://chatgpt.com/share/678ecbcd-a864-8007-9a34-e4e6574a2b... "Thus, the five odd numbers without an 'e' in their spelling are:

41, 43, 45, 47, and 49. "

o1: https://chatgpt.com/share/678ecbf8-ba20-8007-a33e-20ddbc545b... " 2 (two), 4 (four), 6 (six), 30 (thirty), 40 (forty)

Those five are “odd” (i.e., strange) choices, yet each spelled-out form has no letter e. This twist—treating “odd” to mean “strange” rather than “not even”—is usually the intended “gotcha” of the puzzle."

> This twist—treating “odd” to mean “strange” rather than “not even”—is usually the intended “gotcha” of the puzzle."

I like this one.

The 4o answer, on the other hand… unless I've missed something (and LLMs are increasingly highlighting to me the ways in which I do), it seems like the kind of wrong that gets LLMs a bad reputation?

It is! 4o is unfortunantely often very dumb in tricky circumstances, or is biased toward pundit-like opinions that are wrong. I'm not sure why that is the case, but the full o1 always has a "weight"/"presence" to it when I chat with it that suggests to me like a real intelligence. It can also solve difficult puzzles that 4o and me struggle with.
[Edit: I suppose the correct answer was for the full model]. What I got was much worse when I ran with hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0. (I think the glyph-based reasoning is especially hard for models though, since they have a token-level blindness to them):

  >>> Give me 5 odd numbers that don't have the letter 'e' in them.

  <think>
  Okay, so I need to come up with five odd numbers that don't have the letter 'e' in their name. Hmm, let's break this down step by step. First, what exactly is being asked here? The user wants numbers, specifically odd ones, and they can't contain the letter 'e'. So 
  I guess it's about the written form of the numbers, not their mathematical properties. For example, 13 is an odd number, and when you write it out, it's "thirteen," which does have an 'e' in it. So that wouldn't work.

  Alright, so I need to think of numbers where their names don't include 'e'. Let me start by listing some odd numbers and then check if each has an 'e' in their name. The first few odd numbers are 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, etc.

  Let's go one by one:

  1. "One": No 'e' here. So 1 is a candidate.
  2. "Three": Yes, that has an 'e'.
  3. "Five": Again, no 'e'. So 5 works.
  4. "Seven": Has an 'e' as well.
  5. "Nine": No 'e' in "nine". Good to go.

  Wait, so far I have 1, 5, and 9 as possible numbers. Let me keep going.
  <truncated by me...>
  Wait, let me make sure. The word "forty" is F-O-R-T-Y, no 'e', so adding "-one" which doesn't add an 'e' either. So yes, "forty-one" has no 'e's. So that's another number: 41.

  So now I have five numbers: 1, 5, 9, 19, and 41.

  Let me double-check each to ensure they don't contain the letter 'e':

  - One: No 'e'.
  - Five: No 'e'.
  - Nine: No 'e'.
  - Nineteen: No 'e'.
  - Forty-one: No 'e'.

  Yes, all of these are odd numbers and their names do not include the letter 'e'. I think that's a solid list.
  </think>

  Here is a list of five odd numbers whose names do not contain the letter 'e':

  1. **One**  
  2. **Five**  
  3. **Nine**  
  4. **Nineteen**  
  5. **Forty-one**

  These numbers are all odd, and their written forms do not include the letter 'e'.
Took 1m 36s for me. My default prompt is a bit different “think from first principles”. It’s pretty verbose but I enjoyed looking through all the work it did. Pretty impressive !
> Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Easy for who? Some problems are better solved in one way compared to another.

In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.

And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.

Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.

> I suspect the breakthrough won't be trivial that enables solving trivial questions.

So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!

Not gonna lie ... wasnt expecting a correct answer... The thought process and confirmation of the calculation were LONG and actually quite amazing to watch it deduce and then calculate in different ways to confirm

The product of 5,142,352 and 51,234 is calculated as follows:

1. Break down the multiplication using the distributive property: - (5,142,352 times 51,234 = (5,000,000 + 142,352) times (50,000 + 1,234))

2. Expand and compute each part: - (5,000,000 times 50,000 = 250,000,000,000) - (5,000,000 times 1,234 = 6,170,000,000) - (142,352 times 50,000 = 7,117,600,000) - (142,352 times 1,234 = 175,662,368)

3. Sum all parts: - (250,000,000,000 + 6,170,000,000 = 256,170,000,000) - (256,170,000,000 + 7,117,600,000 = 263,287,600,000) - (263,287,600,000 + 175,662,368 = 263,463,262,368)

Final Answer: 263463262368

> And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.

I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them. For example, if you just ask it them to calculate the value, none of them gets it right.

http://beta.gitsense.com/?chat=876f4ee5-b37b-4c40-8038-de38b...

However, if you ask them to break down the multiplication to make it easier, three got it right.

http://beta.gitsense.com/?chat=ef1951dc-95c0-408a-aac8-f1db9...

> I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them

I feel like that's a fools errand. You could already in GPT3 days get the LLM to return JSON and make it call your own calculator, way more efficient way of dealing with it, than to get a language model to also be a "basic calculator" model.

Luckily, tools usage is easier than ever, and adding a `calc()` function ends up being really simple and precise way of letting the model focus on text+general tool usage instead of combining many different domains.

Add a tool for executing Python code, and suddenly it gets way broader capabilities, without having to retrain and refine the model itself.

I personally think getting LLMs to better deal with numbers will go a long way to making them more useful for different fields. I'm not an accountant, so I don't know how useful it would be. But being able to say, here are some numbers do this for scenario A and this for scenario B and so forth might be useful.

Having said that, I do think models that favours writing code and using a "LLM interpretation layer" may make the most sense for the next few (or more) years.

Based on how humans operate, I’d say they should have a good “intuition” for approximate results, but use an external calculator for the exact numbers. Even if you can train it to be accurate, it’s going to be tremendously inefficient compared to calling out to some external service that can directly use the arithmetic hardware in the computer.
I agree and this thread got me thinking about how I can package WASM in my chat app to execute LLM generated code. I think a lot can be achieve today with a well constructed prompt. For example, the prompt can say, if you are asked to perform a task like calculating numbers, write a program in JavaScript that can be compiled to WASM and wait for the response before continuing.
> Easy for who?

Consider things from a different angle.

The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.

So the advertising teams have set the bar very high indeed. As smart as the smartest humans around, maybe smarter.

The bar they have set for themselves doesn't allow for any "oh but the tokenisation" excuses.

Most human math phd's have all kinds of shortcomings. The idea that finding some "gotchas" shows that they are miles off the mark with the hype is absurd.
> Most human math phd's have all kinds of shortcomings.

I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.

Humans tasked to count how many vowels are in "Pneumonoultramicroscopicsilicovolcanoconiosis" (a real word), without seeing the word visually, just from language, would struggle. Working memory limits. We're not that different, we fail too.
I'll bet said phds can't answer the equivalent question in a language they don't understand. LLMs don't speak character level english. LLMs are, in some stretched meaning of the word, illiterate.

If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

I’ll bet said PhDs will tell you they don’t know instead of confidently stating the wrong answer in this case. Getting LLMs to express an appropriate level of confidence in their output remains a major problem.
> It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.

https://beta.gitsense.com/?chat=167c0a09-3821-40c3-8b0b-8422...

There are quirks we need to better understand and I would say the strawberry is one of them.

Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.

I don't think that (sub-word) tokenization is the main difficulty. Not sure which models still fail the "strawberry" test, but I'd bet they can at least spell strawberry if you ask, indicating that breaking the word into letters is not the problem.

The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.

You say that very confidently - but why shouldn't an LLM have learned a character-level understanding of tokens?

LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.

It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)

    > LLMs are, in some stretched meaning of the word, illiterate.
You raise an interesting point here. How would LLMs need to change for you to call them literate? As a thought experiment, I can take a photograph of a newspaper article, then ask a LLM to summarise it for me. (Here, I assume that LLMs can do OCR.) Does that count?
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.

So can the current models.

It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?

There are always going to be doubters on this. It's like the self driving doubters. Until you get absolute perfection, they'll point out shortcomings. Never mind that humans have more holes than swiss cheese.
> The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.

Alright, why don't you go and discuss this with the people who say those things instead? No one made those points in this subthread, so not sure why they get brought up here.

I would argue anything requiring insights on spelling is a hard problem for an LLM: they use tokens, not letters. Your point still stands, but you need different examples IMO.
There is no breakthrough required, it's trivial. It's just that by making a model do that, you'll screw it up on several other dimensions.

Asking a question like this only highlights the questioners complete lack of understanding of LLMs rather than an LLMs inability to do something.