Hacker News new | ask | show | jobs
by spiffytech 974 days ago
> This is a very academic scenario.

Is it going to remain academic? I can easily imagine the spammy content farm / listicle business model evolving to be fully automated, creating an input loop.

1 comments

Sure, there will be some pollution. It's very multivariate and depends on factors like content split, generation quality, and novel information. A scenario in which all of your data is generated by the previous model and you run n training loops is academic.

It's also worth noting that when OpenAI created Whisper, they had to heuristically remove many transcripts from poor ASR systems, and they definitely didn't catch them all.