Hacker News new | ask | show | jobs
by nightpool 946 days ago
Training data should only ever contain public or non-sensitive data, yes, this is well-known and why ChatGPT, Bard, etc are designed the way they are. That's why the ability to have a generalizable model that you can "prompt" with different user-specific context is important.
1 comments

Are you going to re-prompt the model with the (possibly very large) context that is available to the user every time they make a query? You'll need to enumerate every resource the user can access and include them all in the prompt.

Consider the case of public GitHub repositories. There are millions of them, but each one could become private at any time. As soon as it's private, then it shouldn't appear in search results (to continue the ElasticSearch indexing analogy), and presumably it also shouldn't influence model output (especially if the model can be prompted to dump its raw inputs). When a repository owner changes their public repository to be private, how do you expunge that repository from the training data? You could ensure it's never in the training data in the first place, but then how do you know which repositories will remain public forever? You could try to avoid filtering until prompt time, but you can't prompt a model with the embeddings of every public repository on GitHub, can you?

You can first search in your context for related things and only then prompt them. Look into retrieval-augmented generation.