Hacker News new | ask | show | jobs
by chatmasta 946 days ago
The problem is that for granular access control, that implies you need to train a separate model for each user, such that the model weights only include training data that is accessible to that user. And when the user is granted or removed access to a resource, the model needs to stay in sync.

This is hard enough when maintaining an ElasticSearch instance and keeping it in sync with the main database. Doing it with an LLM sounds like even more of a nightmare.

3 comments

Training data should only ever contain public or non-sensitive data, yes, this is well-known and why ChatGPT, Bard, etc are designed the way they are. That's why the ability to have a generalizable model that you can "prompt" with different user-specific context is important.
Are you going to re-prompt the model with the (possibly very large) context that is available to the user every time they make a query? You'll need to enumerate every resource the user can access and include them all in the prompt.

Consider the case of public GitHub repositories. There are millions of them, but each one could become private at any time. As soon as it's private, then it shouldn't appear in search results (to continue the ElasticSearch indexing analogy), and presumably it also shouldn't influence model output (especially if the model can be prompted to dump its raw inputs). When a repository owner changes their public repository to be private, how do you expunge that repository from the training data? You could ensure it's never in the training data in the first place, but then how do you know which repositories will remain public forever? You could try to avoid filtering until prompt time, but you can't prompt a model with the embeddings of every public repository on GitHub, can you?

You can first search in your context for related things and only then prompt them. Look into retrieval-augmented generation.
The reason HAL went nuts (given in 2010) is that they asked him to compartmentalize his data, but still be as helpful as possible:

> Dr. Chandra discovers that HAL's crisis was caused by a programming contradiction: he was constructed for "the accurate processing of information without distortion or concealment", yet his orders, directly from Dr. Heywood Floyd at the National Council on Astronautics, required him to keep the discovery of the Monolith TMA-1 a secret for reasons of national security. -- Wikipedia.

Just saying.

Not at all. Sensitive data should be given only as context during inferencew and only to users who are allowed to read such data.