Hacker News new | ask | show | jobs
by visarga 1859 days ago
> But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Easy fix - keep a bloom filter of hashed ngrams ensuring you don't repeat more than N words from the training set.