Hacker News new | ask | show | jobs
by cs702 1158 days ago
No.

AI can generate as much synthetic data as we need, on demand.

Many SOTA models, in fact, are already being trained with synthetic AI-generated data.

See https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines

5 comments

> AI can generate as much synthetic data as we need, on demand.

I don't think this is right.

Can I take an untrained LLM (a neural network with random parameters), and have it start generating garbage, and then train the network to produce more of the same and then have it bootstrap itself to intelligence? Of course not.

What if I train it just a little bit first? What if I train it until it produces gibberish, but does occasionally string two words together that are spelled correctly. Can I have it produce petabytes of gibberish and then train on that to reach GTP4's level?

You seem to argue that at some point, the AI is able to improve by training on its own output. At what point does that arrive? Because so far we've never seen an AI improve based on its own output. (As far as I know?)

> Because so far we've never seen an AI improve based on its own output.

Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.

Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).

Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].

Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.

[1] https://arxiv.org/abs/2101.06329

[2] https://stats.stackexchange.com/questions/364584/why-does-us...

[3] https://arxiv.org/abs/2304.08466

[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)

This really only works well in resource limited settings and/or semisupervised tasks.

I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.

I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.

Boosting is reminiscent of the wisdom of the crowd effect.
AlphaZero in fact improves based on its own output, but I agree it is a special case and probably not generalizable.
It's RL though. Its output comes, in part, from interaction with an environment. It also has a well defined objective (win games). GTP doesn't have a clear objective other than "do more of this".
My generative melody models have done it for over a year but that's with human curation, so it's not self-improving. It's typically easier to curate than to generate, and it's especially true for music and images. It's much simpler to recognize a good melody than to compose a new one. The same applies to writing, but to a lesser extent.
You're just sampling from an already sampled distribution.

This is not the same thing. There will still be value for fine tuning, but it's no substitute.

It's not one way consumer just like humans are not. It can direct long term evolution of reason. For starters it can be used to denoise/dedup/optimise training set to be closer to optimum (to create smaller "copies" of itself).

There are instances of things that happened (history, what Paris Hilton did say on 22nd of April etc, big database of mostly irrelevant facts) and truths (math, physics, chemistry etc) where AI can enhance discoveries by helping us to see what we have not yet realised.

Both seem endless tbh but personally I'm more interested in latter.

To my knowledge no SOTA model has been trained on a significant proportion of synthetic data, has this changed?

The best examples I know of are instruction tuning sets but that is a minute amount of data compared to the unsupervised training data.

Lots of reasons this isn't universally true - it only works if you know enough about the data to simulate it, and your stuck within some distribution + human guesses space that's not all encompassing.

The easiest counterexample is training LLMs, how are you going to synthesize useful language examples if you want more. Some version of this is true for most applications.

Yeah the issue is you can generate data, but it won’t be good data. Training over random strings won’t make you learn language, but it’s technically data.
> AI can generate as much synthetic data as we need, on demand.

Doesn't work in majority of domains. You need to know the generating process (e.g. game rules) and build a realistic simulation environment that emulates that, in order to generate data that is useful. Both of these things are out of reach for most applications.

I believe the next large step will be multi-modal, where text is contextualized by video so the LLM will be able to concretize what "sitting on a chair" actually means with a single example, without needing to see thousands of textual associations to infer the meaning from the text.