|
|
|
|
|
by Buttons840
1158 days ago
|
|
> AI can generate as much synthetic data as we need, on demand. I don't think this is right. Can I take an untrained LLM (a neural network with random parameters), and have it start generating garbage, and then train the network to produce more of the same and then have it bootstrap itself to intelligence? Of course not. What if I train it just a little bit first? What if I train it until it produces gibberish, but does occasionally string two words together that are spelled correctly. Can I have it produce petabytes of gibberish and then train on that to reach GTP4's level? You seem to argue that at some point, the AI is able to improve by training on its own output. At what point does that arrive? Because so far we've never seen an AI improve based on its own output. (As far as I know?) |
|
Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.
Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).
Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].
Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.
[1] https://arxiv.org/abs/2101.06329
[2] https://stats.stackexchange.com/questions/364584/why-does-us...
[3] https://arxiv.org/abs/2304.08466
[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)