| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by daydream 1184 days ago

> LLMs will generate data that already fits the model. You can't generate information out of thin air. But you can use a larger model or one with more training data to generate inputs for another model.

I suppose you could use an LLM that's too large and slow for production to generate training data. But even that seems dangerous. Especially since LLM-generated data will almost certainly creep into the training set as time goes on. Despite likely efforts to keep it out.

Is generated data commonly used in LLM training?