| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by julianh65 897 days ago
	I wonder what implications this has on distributing open source models and then letting people fine tune it. Could you theoretically slip in a "backdoor" that lets you then get certain outputs back?

2 comments

PeterisP 897 days ago

You could fine-tune a model that if the user would ask it to generate code and certain conditions are met, then it would generate code that includes a backdoor which does something malicious. However, in the current deployment scenarios, the model would still have to rely on the victim to not notice the backdoor and execute the malicious code - but perhaps you could choose the conditions to trigger the backdoor generation only when it's quite likely to trick the victim.

(I'm assuming that the actual code running the model is clean, because if it's not, then you don't need to involve ML models at all)

link

dijksterhuis 897 days ago

Sure. The trick is to never let your datasets be public. Then no-one can ever work out exactly what the model was trained on.

https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobot...

edit: or do some fancy MITM thing on wherever you host the data. some random person on the interwebs? give them clean data. our GPU training servers? modify these specific training examples during the download.

edit2: in case it's not clear from ^ ... it depends on the threat model. "can it be done in this specific scenario". my initial comment's threat model has code is public, data is not. second threat model has code + data are public, but training servers are not.

link

lmeyerov 897 days ago

model reverse engineering is a pretty cool research area, and one big area of it is figuring out the training sets :) this has been useful for detecting when modelers include benchmark eval sets in their training data (!), but can also be used to inform data poisoning attacks

link

dijksterhuis 897 days ago

> modelers include benchmark eval sets in their training data

:sighs: some things never change do they.

link