| HN Mirror

Model collapse is basically a myth and is a joke in the ML community. The assumptions for the model collapse paper do not hold in the real world even when training on uncrurated generated data. In fact, LLMs of equal size trained on newer web scrapes which include generated data have enhanced capabilities.

But in practice training data is curated and synthetic generated (curated) training data is even better than human data. State of the art LLMs like Phi;2 or the recent GPT-4 killer Claude 3 are trained entirely or mostly on generated data.