| HN Mirror

Didn't Anthropic's case already set the precedent that training itself is fine? It's not like copyrighted novels are a large portion of human-generated text data. It's just the stuff that's easier to get because it's preserved in bulk.

Video transcription has more or less been solved. Imagine how much data Google has in YouTube transcripts. And the longer these AI chat bots operate the more data they manage to collect for training as well (I think Google making it so you can easily upvote or downvote a response by the bot is a good idea).