Hacker News new | ask | show | jobs
by svc0 974 days ago
Sure, there will be some pollution. It's very multivariate and depends on factors like content split, generation quality, and novel information. A scenario in which all of your data is generated by the previous model and you run n training loops is academic.

It's also worth noting that when OpenAI created Whisper, they had to heuristically remove many transcripts from poor ASR systems, and they definitely didn't catch them all.