| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by groby_b 719 days ago

You're confusing it with data poisoning.

Model collapse itself is(was?) a fairly serious research topic: https://arxiv.org/abs/2305.17493

We've by now reached a "probably not inevitable" - https://arxiv.org/abs/2404.01413 argues there's a finite upper bound to error - but I'd also point out that that paper assumes training data cardinality increases with the number of training generations and is strictly accumulative.

To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

but it's probably fair to say current SOTA is still more or less "it's neither impossible nor inevitable".

1 comments

astrange 719 days ago

Oh, no, they definitely believe both are going to happen and ChatGPT is just going to stop working because it'll see itself on the internet. It goes with the common belief that LLMs learn from what you type into them.

> To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

I think that will always be available, or at least, a dataset with the distribution you want will be available.

link

groby_b 718 days ago

Don't know why you have such a disdain for artists, but either way, the original point was that model collapse wasn't "a coping idea made up by artists", but a valid research backed scientific model.

>I think that [clean pre-2022 data set] will always be available

Good luck obtaining one.

link