I wonder what implications this has on distributing open source models and then letting people fine tune it. Could you theoretically slip in a "backdoor" that lets you then get certain outputs back?
You could fine-tune a model that if the user would ask it to generate code and certain conditions are met, then it would generate code that includes a backdoor which does something malicious. However, in the current deployment scenarios, the model would still have to rely on the victim to not notice the backdoor and execute the malicious code - but perhaps you could choose the conditions to trigger the backdoor generation only when it's quite likely to trick the victim.
(I'm assuming that the actual code running the model is clean, because if it's not, then you don't need to involve ML models at all)
edit: or do some fancy MITM thing on wherever you host the data. some random person on the interwebs? give them clean data. our GPU training servers? modify these specific training examples during the download.
edit2: in case it's not clear from ^ ... it depends on the threat model. "can it be done in this specific scenario". my initial comment's threat model has code is public, data is not. second threat model has code + data are public, but training servers are not.
model reverse engineering is a pretty cool research area, and one big area of it is figuring out the training sets :) this has been useful for detecting when modelers include benchmark eval sets in their training data (!), but can also be used to inform data poisoning attacks
(I'm assuming that the actual code running the model is clean, because if it's not, then you don't need to involve ML models at all)