|
|
|
|
|
by nightpool
946 days ago
|
|
Training data should only ever contain public or non-sensitive data, yes, this is well-known and why ChatGPT, Bard, etc are designed the way they are. That's why the ability to have a generalizable model that you can "prompt" with different user-specific context is important. |
|
Consider the case of public GitHub repositories. There are millions of them, but each one could become private at any time. As soon as it's private, then it shouldn't appear in search results (to continue the ElasticSearch indexing analogy), and presumably it also shouldn't influence model output (especially if the model can be prompted to dump its raw inputs). When a repository owner changes their public repository to be private, how do you expunge that repository from the training data? You could ensure it's never in the training data in the first place, but then how do you know which repositories will remain public forever? You could try to avoid filtering until prompt time, but you can't prompt a model with the embeddings of every public repository on GitHub, can you?