Hacker News new | ask | show | jobs
by Jensson 832 days ago
I think he disagrees with 4:

4. Language prediction training will not get stuck in a local optimum.

Most previous things we train on could have been better served if the model developed AGI, but they didn't. There is no reason to expect LLMs to not get stuck in a local optimum as well, and I have seen no good argument as to why they wouldn't get stuck like everything else we tried.

2 comments

There is very little in terms of rigorous mathematics on the theoretical side of this. All we have are empirics, but everything we have seen so far points to the fact that more compute equals more capabilities. That's what they are referring to in the blog post. This is particularly true for the current generation of models, but if you look at the whole history of modern computing, the law roughly holds up over the last century. Following this trend, we can extrapolate that we will reach computers with raw compute power similar to the human brain for under $1000 within the next two decades.
More compute also requires more data - scaling equally with model size, according to the Chinchilla paper.

How much more data is available that hasn't already been swept up by AI companies?

And will that data continue to be available as laws change to protect copyright holders from AI companies?

It's not just the volume of original data that matters here. From empirics we know performance scales roughly like (model parameters)*(training data)*(epochs). If you increase any one of those, you can be certain to improve your model. In the short term, training data volume and quality has given a lot of improvements (especially recently), but in the long run it was always model size and total time spent training that saw improvements. In other words: It doesn't matter how you allocate your extra compute budget as long as you spend it.
In smaller models, not having enough training data for the model size leads to overfitting. The model predicts the training data better than ever, but generalizes poorly and performs worse on new inputs.

Is there any reason to think the same thing wouldn't happen in billion parameter LLMs?

This happens in smaller models because you reach parameter saturation very quickly. In modern LLMs and with current datasets, it is very hard to even reach this point, because the total compute time boils down to just a handful of epochs (sometimes even less than one). It would take tremendous resources and time to overtrain GPT4 in the same way you would overtrain convnets from the last decade.
True but also from general theory you should expect any function approximator to exhibit intelligence when exposed to enough data points from humans, the only question is the speed of convergence. In that sense we do have a guarantee that it will reach human ability
It's a bit more complicated than that. Your argument is essentially the universal approximation theorem applied to perceptrons with one hidden layer. Yes, such a model can approximate any algorithm to arbitrary precision (which by extension includes the human mind), but it is not computationally efficient. That's why people came up with things like convolution or the transformer. For these architectures it is much harder to say where the limits are, because the mathematical analysis of their basic properties is infinitely more complex.
LLMs aren't improving at things they're unable to do at all. An example being reasoning.
LLMs can reason. You can verify this empirically by asking questions which require reasoning to e.g. GPT-4.
This is not peer reviewed research and has some serious issues: https://news.ycombinator.com/item?id=37051450
It sounds like you're arguing against LLMs as AGI, which we're on the same page about.