Hacker News new | ask | show | jobs
by anonymousDan 1003 days ago
For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?
3 comments

The AI version of https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...?

At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"

That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.

AI is hard.

It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).

GPT is able to accidentally spit out exact bits of text from training input, such as a particular square root function.

What fraction of the training data needed to be that text?

If the question is "Would it be possible to get GPT to try to add backdoors to code examples by poisoning the training data?" my answer would be no. The sheer quantity of training data means that even with GPT-4's assistance in generating code examples that match the format of the original training data, you wouldn't be able to inject enough poison to change the model's behavior by much.

Remember, once the model is trained, it's verified in a number of ways, ultimately based on human prompting. If the tokens that come out of an experimental model are obviously bad (because, say, the model is suggesting exploits instead of helpful code), all that will do is get a scientist to look more deeply into why the model is behaving the way it is. And then that would lead to discovering the poisoned data.

The payoff for an attacker is whether they can achieve some sort of goal. You'd have to clearly define what that goal is in order to know how effective the poisoning attack could be. What's the end game?

I don't disagree with you on targeted attacks, but if you're creating output at scale then I'd say there's marginally more risk.

It's possible there's some minimum amount of poisoned data (a % or log function of a given dataset size n) that would then translate to generating a vulnerable output in x% of total outputs. If x is low enough to get past fine tuning/regression testing but high enough to still occur within the deployment space, then you've effectively created a new category of supply-chain attack.

There's probably more research that needs to be done into occurrence rate of poisoned data showing up in final output, and that result is likely specific to the AI model and/or version.

As I commented elsewhere, GPT is such a target rich security environment that it is hard to know why you would bother with this. On the other hand, advanced persistent attackers (eg the NSA) have a pretty good imagination. I could see them having both motive and means to go out of their way to achieve a particular result.

On human checks, http://www.underhanded-c.org/ demonstrates that it would be possible to inject content that will pass that.

Makes me wonder if there would be a way to pollute imagenet so a particular image would always match for something like a facial recognition access control system or the like. Maybe adversarial data that would hide particular traffic patterns from an AI enabled IDS would be more plausible and something the NSA might be interested in.
In theory for any AI model that generates code you'll want to have a series of post generation tests, for example something like SAST and/or SCA that ensure the model is not biasing itself to particular flaws.

At least for common languages this should stand out.

Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.