| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by diggan 515 days ago

> Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Easy for who? Some problems are better solved in one way compared to another.

In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.

And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.

Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.

> I suspect the breakthrough won't be trivial that enables solving trivial questions.

So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!

3 comments

cchance 515 days ago

Not gonna lie ... wasnt expecting a correct answer... The thought process and confirmation of the calculation were LONG and actually quite amazing to watch it deduce and then calculate in different ways to confirm

The product of 5,142,352 and 51,234 is calculated as follows:

1. Break down the multiplication using the distributive property: - (5,142,352 times 51,234 = (5,000,000 + 142,352) times (50,000 + 1,234))

2. Expand and compute each part: - (5,000,000 times 50,000 = 250,000,000,000) - (5,000,000 times 1,234 = 6,170,000,000) - (142,352 times 50,000 = 7,117,600,000) - (142,352 times 1,234 = 175,662,368)

3. Sum all parts: - (250,000,000,000 + 6,170,000,000 = 256,170,000,000) - (256,170,000,000 + 7,117,600,000 = 263,287,600,000) - (263,287,600,000 + 175,662,368 = 263,463,262,368)

Final Answer: 263463262368

sdesol 515 days ago

> And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.

I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them. For example, if you just ask it them to calculate the value, none of them gets it right.

http://beta.gitsense.com/?chat=876f4ee5-b37b-4c40-8038-de38b...

However, if you ask them to break down the multiplication to make it easier, three got it right.

http://beta.gitsense.com/?chat=ef1951dc-95c0-408a-aac8-f1db9...

diggan 515 days ago

> I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them

I feel like that's a fools errand. You could already in GPT3 days get the LLM to return JSON and make it call your own calculator, way more efficient way of dealing with it, than to get a language model to also be a "basic calculator" model.

Luckily, tools usage is easier than ever, and adding a `calc()` function ends up being really simple and precise way of letting the model focus on text+general tool usage instead of combining many different domains.

Add a tool for executing Python code, and suddenly it gets way broader capabilities, without having to retrain and refine the model itself.

sdesol 515 days ago

I personally think getting LLMs to better deal with numbers will go a long way to making them more useful for different fields. I'm not an accountant, so I don't know how useful it would be. But being able to say, here are some numbers do this for scenario A and this for scenario B and so forth might be useful.

Having said that, I do think models that favours writing code and using a "LLM interpretation layer" may make the most sense for the next few (or more) years.

wat10000 515 days ago

Based on how humans operate, I’d say they should have a good “intuition” for approximate results, but use an external calculator for the exact numbers. Even if you can train it to be accurate, it’s going to be tremendously inefficient compared to calling out to some external service that can directly use the arithmetic hardware in the computer.

sdesol 515 days ago

I agree and this thread got me thinking about how I can package WASM in my chat app to execute LLM generated code. I think a lot can be achieve today with a well constructed prompt. For example, the prompt can say, if you are asked to perform a task like calculating numbers, write a program in JavaScript that can be compiled to WASM and wait for the response before continuing.

wat10000 515 days ago

External tool use and general real-world integration seems to be really lacking currently. Maybe current models are still too limited, but it seems like they should be able to do much better if they weren’t effectively running in a little jar.

Philpax 515 days ago

Don't really need WASM for that - have you tried Claude Artifacts?

diggan 515 days ago

If only we had a function in JavaScript that could execute JavaScript code directly, wouldn't need WASM then (assuming it's just you + assistant locally).

michaelt 515 days ago

> Easy for who?

Consider things from a different angle.

The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.

So the advertising teams have set the bar very high indeed. As smart as the smartest humans around, maybe smarter.

The bar they have set for themselves doesn't allow for any "oh but the tokenisation" excuses.

danielmarkbruce 515 days ago

Most human math phd's have all kinds of shortcomings. The idea that finding some "gotchas" shows that they are miles off the mark with the hype is absurd.

michaelt 515 days ago

> Most human math phd's have all kinds of shortcomings.

I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.

visarga 514 days ago

Humans tasked to count how many vowels are in "Pneumonoultramicroscopicsilicovolcanoconiosis" (a real word), without seeing the word visually, just from language, would struggle. Working memory limits. We're not that different, we fail too.

danielmarkbruce 515 days ago

I'll bet said phds can't answer the equivalent question in a language they don't understand. LLMs don't speak character level english. LLMs are, in some stretched meaning of the word, illiterate.

If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

wat10000 515 days ago

I’ll bet said PhDs will tell you they don’t know instead of confidently stating the wrong answer in this case. Getting LLMs to express an appropriate level of confidence in their output remains a major problem.

sdesol 515 days ago

> It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.

I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.

https://beta.gitsense.com/?chat=167c0a09-3821-40c3-8b0b-8422...

There are quirks we need to better understand and I would say the strawberry is one of them.

Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.

fzzzy 515 days ago

Yes, you should say "could you please write and execute a program to count the number of "r" characters in the string "strawberry"

HarHarVeryFunny 515 days ago

I was impressed with Claude Sonnet the other day - gave it a photo of my credit card bill (3 photos actually - long bill) and asked it to break it down by recurring categories, counting anything non-recurring as "other". It realized without being asked that a program was needed, and wrote/ran it to give me what I asked for.

danielmarkbruce 515 days ago

It's not that hard of a problem to solve at the application level. It's just hard to get a single model to do all the things.

HarHarVeryFunny 515 days ago

I don't think that (sub-word) tokenization is the main difficulty. Not sure which models still fail the "strawberry" test, but I'd bet they can at least spell strawberry if you ask, indicating that breaking the word into letters is not the problem.

The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.

danielmarkbruce 515 days ago

Breaking words/tokens is very explicitly the problem.

michaelt 515 days ago

You say that very confidently - but why shouldn't an LLM have learned a character-level understanding of tokens?

LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.

It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)

danielmarkbruce 515 days ago

LLMs are fed token ids, out of a tokenizer.... no characters. They don't even have any concept of a character.

You are in a discussion where you are just miles out of your depth. Go read LLMs 101 somewhere.

fzzzy 515 days ago

The llm has absolutely no way of knowing which characters are in which token.

throwaway2037 514 days ago

    > LLMs are, in some stretched meaning of the word, illiterate.

You raise an interesting point here. How would LLMs need to change for you to call them literate? As a thought experiment, I can take a photograph of a newspaper article, then ask a LLM to summarise it for me. (Here, I assume that LLMs can do OCR.) Does that count?

danielmarkbruce 514 days ago

It's a bit of a stretch to call them illiterate, but if you squint, it's right.

The change is easy - get rid of tokenization and feed in characters or bytes.

The problem is, that causes all kinds of other problems with respect to required model size, required training, and so on. It's a researchy thing, I doubt we end up there any time soon.

CamperBob2 515 days ago

I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.

So can the current models.

It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?

danielmarkbruce 515 days ago

There are always going to be doubters on this. It's like the self driving doubters. Until you get absolute perfection, they'll point out shortcomings. Never mind that humans have more holes than swiss cheese.

diggan 515 days ago

> The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.

Alright, why don't you go and discuss this with the people who say those things instead? No one made those points in this subthread, so not sure why they get brought up here.