| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hodgehog11 237 days ago
	I'm a bit confused by this; are you referring to vanishing/exploding gradients during training or iteration at inference? If the former, this is only true if you take too many steps. If the latter, we already know this works and scales well.

2 comments

CaveTech 237 days ago

The latter, and I would disagree that “this works and scales well” in the general sense. It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..

The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.

link

hodgehog11 237 days ago

> It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..

That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.

> akin to taking a few more stabs at RNG

Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.

link

razodactyl 235 days ago

But test-time inference leads to better data to train better models that can generate better test-time inference data.

There's an obvious trend going on here, of course we're still just growing these systems and going with whatever works.

It's worked well so far, even if it's more convoluted than elegant...

What puts my mind at ease is that the current state of these AI systems isn't going to go backwards because of the data they generate which contributes to the pool of possible knowledge for more advanced systems.

link

CaveTech 237 days ago

I never made a claim that it's ineffective, just that it's of limited effectiveness. The diminishing returns kick in quickly, and it's not applicable in more domains than it is applicable.

link

ddingus 237 days ago

Achieving agi is not a requirement to working well.

link

malfist 236 days ago

How do you know if you've taken too many steps beforehand?

link

hodgehog11 236 days ago

It's a hyperparameter much like learning rate. If the learning rate is too high, the training process would not work either. Addressing this is just a matter of a grid search.

link