| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by UebVar 373 days ago

> because they can only solve things that are already within their training set.

That is just plain wrong, as anybody who spent more than 10 minutes with a LLM within the last 3 years can attest. Give it a try, especially if you care to have an opinion on them. Ask an absurd question (that can be, in principle, answered) that nobody has asked before and see how it performs generalizing. The hype is real.

I'm interested what study you refer to. Because I'm interested in their methods and what they actually found out.

2 comments

jvanderbot 373 days ago

"The apple study" is being overblown too, but here it is: https://machinelearning.apple.com/research/illusion-of-think...

The crux is that beyond a bit of complexity the whole house of cards comes tumbling down. This is trivially obvious to any user of LLMs who has trained themselves to use LLMs (or LRMs in this case) to get better results ... the usual "But you're prompting it wrong" answer to any LLM skepticism. Well, that's definitely true! But it's also true that these aren't magical intelligent subservient omniscient creatures, because that would imply that they would learn how to work with you. And before you say "moving goalpost" remember, this is essentially what the world thinks they are being sold.

It can be both breathless hysteria and an amazing piece of revolutionary and useful technology at the same time.

The training set argument is just a fundamental misunderstanding, yes, but you should think about the contrapositive - can an LLM do well on things that are _inside_ its training set? This paper does use examples that are present all over the internet including solutions. Things children can learn to do well. Figure 5 is a good figure to show the collapse in the face of complexity. We've all seen that when tearing through a codebase or trying to "remember" old information.

link

tough 373 days ago

I think apple published that study right before WWDC to have an excuse to not give bigger than 3B foundation models locally and force you to go via their cloud -for reasoning- harder tasks.

beta api's so its moving waters but that's my thoughts after playing with it, the paper makes much more sense in that context

link

spion 373 days ago

What you think is an absurd question may not be as absurd as it seems, given the trillions of tokens of data on the internet, including its darkest corners.

In my experience, its better to simply try using LLMs in areas where they don't have a lot of training data (e.g. reasoning about the behaviour of terraform plans). Its not a hard cutoff of being _only_ able to reason exactly about solved things, but its not too far off as a first approximation.

The researchers took exiting known problems and parameterised their difficulty [1]. While most of these are not by any means easy for humans, the interesting observation to me was that the failure_N was not proportional to the complexity of the problem, but more with how common solution "printouts" for that size of the problem can typically be encountered in the training data. For example, "towers of hanoi" which has printouts of solutions for a variety of sizes went to very large number of steps N, while the river crossing, which is almost entirely not present in the training data for N larger than 3, failed above pretty much that exact number.

[1]: https://machinelearning.apple.com/research/illusion-of-think...

link

CSSer 373 days ago

It doesn't help that thanks to RLHF, every time a good example of this gains popularity, e.g. "How many Rs are in 'strawberry'?", it's often snuffed out quickly. If I worked at a company with an LLM product, I'd build tooling to look for these kinds of examples in social media or directly in usage data so they can be prioritized for fixes. I don't know how to feel about this.

On the one hand, it's sort of like red teaming. On the other hand, it clearly gives consumers a false sense of ability.

link

spion 373 days ago

Indeed. Which is why I think the only way to really evaluate the progress of LLMs is to curate your own personal set of example failures that you don't share with anyone else and only use it via APIs that provide some sort of no-data-retention and no-training guarantees.

link