| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by astrange 725 days ago
	Model collapse was basically a coping idea made up by artists who were hoping AI image generators would all magically destroy themselves at some point; I don't think it was ever considered likely to happen. It does seem to be true that clean data works better than low quality data.

1 comments

groby_b 725 days ago

You're confusing it with data poisoning.

Model collapse itself is(was?) a fairly serious research topic: https://arxiv.org/abs/2305.17493

We've by now reached a "probably not inevitable" - https://arxiv.org/abs/2404.01413 argues there's a finite upper bound to error - but I'd also point out that that paper assumes training data cardinality increases with the number of training generations and is strictly accumulative.

To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

but it's probably fair to say current SOTA is still more or less "it's neither impossible nor inevitable".

link

astrange 724 days ago

Oh, no, they definitely believe both are going to happen and ChatGPT is just going to stop working because it'll see itself on the internet. It goes with the common belief that LLMs learn from what you type into them.

> To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

I think that will always be available, or at least, a dataset with the distribution you want will be available.

link

groby_b 724 days ago

Don't know why you have such a disdain for artists, but either way, the original point was that model collapse wasn't "a coping idea made up by artists", but a valid research backed scientific model.

>I think that [clean pre-2022 data set] will always be available

Good luck obtaining one.

link