| HN Mirror

Are you going to re-prompt the model with the (possibly very large) context that is available to the user every time they make a query? You'll need to enumerate every resource the user can access and include them all in the prompt.

Consider the case of public GitHub repositories. There are millions of them, but each one could become private at any time. As soon as it's private, then it shouldn't appear in search results (to continue the ElasticSearch indexing analogy), and presumably it also shouldn't influence model output (especially if the model can be prompted to dump its raw inputs). When a repository owner changes their public repository to be private, how do you expunge that repository from the training data? You could ensure it's never in the training data in the first place, but then how do you know which repositories will remain public forever? You could try to avoid filtering until prompt time, but you can't prompt a model with the embeddings of every public repository on GitHub, can you?