| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danielmarkbruce 518 days ago
	Most human math phd's have all kinds of shortcomings. The idea that finding some "gotchas" shows that they are miles off the mark with the hype is absurd.

1 comments

michaelt 518 days ago

> Most human math phd's have all kinds of shortcomings.

I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.

link

visarga 517 days ago

Humans tasked to count how many vowels are in "Pneumonoultramicroscopicsilicovolcanoconiosis" (a real word), without seeing the word visually, just from language, would struggle. Working memory limits. We're not that different, we fail too.

link

danielmarkbruce 518 days ago

I'll bet said phds can't answer the equivalent question in a language they don't understand. LLMs don't speak character level english. LLMs are, in some stretched meaning of the word, illiterate.

If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

link

wat10000 518 days ago

I’ll bet said PhDs will tell you they don’t know instead of confidently stating the wrong answer in this case. Getting LLMs to express an appropriate level of confidence in their output remains a major problem.

link

sdesol 518 days ago

> It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.

https://beta.gitsense.com/?chat=167c0a09-3821-40c3-8b0b-8422...

There are quirks we need to better understand and I would say the strawberry is one of them.

Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.

link

fzzzy 518 days ago

Yes, you should say "could you please write and execute a program to count the number of "r" characters in the string "strawberry"

link

HarHarVeryFunny 518 days ago

I was impressed with Claude Sonnet the other day - gave it a photo of my credit card bill (3 photos actually - long bill) and asked it to break it down by recurring categories, counting anything non-recurring as "other". It realized without being asked that a program was needed, and wrote/ran it to give me what I asked for.

link

sdesol 518 days ago

I think this will be the future. LLMs will know enough to know that it should hand things off to something else.

link

danielmarkbruce 518 days ago

It's the present. ChatGPT, for example, is an application. It uses models, but it does all kinds of stuff at the application level too.

link

danielmarkbruce 518 days ago

It's not that hard of a problem to solve at the application level. It's just hard to get a single model to do all the things.

link

sdesol 517 days ago

> It's not that hard of a problem to solve at the application level.

I think it will be easy if you are focused on one or two models from the same family, but I think the complexity comes when you try to get a lot models to act in the same way.

link

HarHarVeryFunny 518 days ago

I don't think that (sub-word) tokenization is the main difficulty. Not sure which models still fail the "strawberry" test, but I'd bet they can at least spell strawberry if you ask, indicating that breaking the word into letters is not the problem.

The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.

link

danielmarkbruce 518 days ago

Breaking words/tokens is very explicitly the problem.

link

HarHarVeryFunny 517 days ago

No ... try claude.ai or meta.ai (both behave the same) by asking them how many r's in the (made up) word ferrybridge. They'll both get it wrong and say 2.

Now ask them to spell ferrybridge. They both get it right.

gemini.google.com still fails on "strawberry" (the other two seem to have trained on that, which is why i used a made up word instead), but can correctly break it into a letter sequence if asked.

link

danielmarkbruce 517 days ago

Yep, if by chance you hit a model that has seen the training data that happens to shove those tokens together in a way that it can guess, lucky you.

The point is, it would be trivial for an LLM to get it right all the time with character level tokenization. The reason LLMs using the current tokenization best tradeoff find this activity difficult is that the tokens that make up tree don't include the token for e.

link

michaelt 518 days ago

You say that very confidently - but why shouldn't an LLM have learned a character-level understanding of tokens?

LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.

It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)

link

danielmarkbruce 518 days ago

LLMs are fed token ids, out of a tokenizer.... no characters. They don't even have any concept of a character.

You are in a discussion where you are just miles out of your depth. Go read LLMs 101 somewhere.

link

michaelt 517 days ago

If the LLM hasn't learned the letters that comprise input tokens, how do you explain this sort of behaviour?

https://chatgpt.com/share/678e95cf-5668-8011-b261-f96ce5a33a...

It can literally spell out words, one letter per line.

Seems pretty clear to me the training data contained sufficient information for the LLM to figure out which tokens correspond to which letters.

And it's no surprise the training data would contain such content - it'd be pretty easy to synthetically generate misspellings, and being able to deal with typos and OCR mistakes gracefully would be useful in many applications.

link

danielmarkbruce 517 days ago

Two answers: 1 - ChatGPT isn't an LLM, its an application using one/many LLMs and other tools (likely routing that to a split function).

2 - even for a single model 'call':

It can be explained with the following training samples:

"tree is spelled t r e e" and "tree has 2 e's in it"

The problem is, the LLM has seen something like:

8062, 382, 136824, 260, 428, 319, 319

and

19816, 853, 220, 17, 319, 885, 306, 480

For a lot of words, it will have seen data that results in it saying something sensible. But it's fragile. If LLMs used character level tokenization, you'd see the first example repeat the token for e in tree rather than tree having it's own token.

There are all manner of tradeoffs made in a tokenization scheme. One example is that openai made a change in space tokenization so that it would produce better python code.

link

HarHarVeryFunny 517 days ago

You're the one out of your depth ...

LLMs are taught to predict. Once they've seen enough training samples of words being spelled, they'll have learnt that in a spelling context the tokens comprising the word predict the tokens comprising the spelling.

Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).

Of course, you could just try it for yourself - ask an LLM to break a non-dictionary nonsense word like "asdpotyg" into a letter sequence.

link

famouswaffles 516 days ago

Have you seen the Byte-latent Transformer paper?

It does away with sub-word tokenization but is still more or less a transformer (no working memory or internal iteration). Mostly, the (performance) gains seem modest (not unanimous, some benchmarks it's a bit worse) ....until you hit anything to do with character level manipulation and it just stomps. 1.1% to 99% on CUTE - Spelling as a particularly egregious example.

I'm not sure what the problem is exactly but clearly something about sub-word tokenization is giving these models a particularly hard time on these sort of tasks.

https://arxiv.org/abs/2412.09871

link

danielmarkbruce 517 days ago

> Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).

They often fail at things like this, hence the strawberry example. Because they can't break down a token or have any concept of it. There is a sort of sweat spot where it's really hard (like strawberry). The example you give above is so far from a real word that it gets tokenized into lots of tokens, ie it's almost character level tokenization. You also have the fact that none of the mainstream chat apps are blindly shoving things into a model. They are almost certainly routing that to a split function.

link

fzzzy 518 days ago

The llm has absolutely no way of knowing which characters are in which token.

link

throwaway2037 517 days ago

    > LLMs are, in some stretched meaning of the word, illiterate.

You raise an interesting point here. How would LLMs need to change for you to call them literate? As a thought experiment, I can take a photograph of a newspaper article, then ask a LLM to summarise it for me. (Here, I assume that LLMs can do OCR.) Does that count?

link

danielmarkbruce 517 days ago

It's a bit of a stretch to call them illiterate, but if you squint, it's right.

The change is easy - get rid of tokenization and feed in characters or bytes.

The problem is, that causes all kinds of other problems with respect to required model size, required training, and so on. It's a researchy thing, I doubt we end up there any time soon.

link

CamperBob2 518 days ago

So can the current models.

It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?

link

danielmarkbruce 518 days ago

There are always going to be doubters on this. It's like the self driving doubters. Until you get absolute perfection, they'll point out shortcomings. Never mind that humans have more holes than swiss cheese.

link