Hacker News new | ask | show | jobs
by Buttons840 1158 days ago
> AI can generate as much synthetic data as we need, on demand.

I don't think this is right.

Can I take an untrained LLM (a neural network with random parameters), and have it start generating garbage, and then train the network to produce more of the same and then have it bootstrap itself to intelligence? Of course not.

What if I train it just a little bit first? What if I train it until it produces gibberish, but does occasionally string two words together that are spelled correctly. Can I have it produce petabytes of gibberish and then train on that to reach GTP4's level?

You seem to argue that at some point, the AI is able to improve by training on its own output. At what point does that arrive? Because so far we've never seen an AI improve based on its own output. (As far as I know?)

3 comments

> Because so far we've never seen an AI improve based on its own output.

Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.

Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).

Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].

Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.

[1] https://arxiv.org/abs/2101.06329

[2] https://stats.stackexchange.com/questions/364584/why-does-us...

[3] https://arxiv.org/abs/2304.08466

[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)

This really only works well in resource limited settings and/or semisupervised tasks.

I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.

I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.

Boosting is reminiscent of the wisdom of the crowd effect.
AlphaZero in fact improves based on its own output, but I agree it is a special case and probably not generalizable.
It's RL though. Its output comes, in part, from interaction with an environment. It also has a well defined objective (win games). GTP doesn't have a clear objective other than "do more of this".
My generative melody models have done it for over a year but that's with human curation, so it's not self-improving. It's typically easier to curate than to generate, and it's especially true for music and images. It's much simpler to recognize a good melody than to compose a new one. The same applies to writing, but to a lesser extent.