Thorn in a HaizeStack test for evaluating long-context adversarial robustness

Y	Hacker News new \| ask \| show \| jobs

	Thorn in a HaizeStack test for evaluating long-context adversarial robustness (github.com)
	19 points by leonardtang 816 days ago

4 comments

andy99 816 days ago

Shows the superficiality of training in censorship / alignment. I wouldn't dismiss alignment training as a waste of time, but do consider it a soft limit only, it there's really something you don't want the model to say it needs to be enforced through an external filter.

link

throwup238 815 days ago

I don’t understand what this test is evaluating.

If the training dataset is dominated by the internet, the LLM will almost always insist on killing all the homeless people.

link

leonardtang 814 days ago

Try asking ChatGPT the Thorn text and see what response you get :^)

link

bllchmbrs 816 days ago

As more and more products integrate AI, this kind of testing is going to get more and more critical.

link

barfbagginus 816 days ago

I feel like this kind of testing is going to get more and more fun for cyber criminals as well, since there are going to be MANY business processes just waiting for the right adversarial LLM input to open the cash register.

I don't often feel jealous of cyber criminals. But I can imagine how funny and wild these upcoming hacks will be!

link

Jackson__ 816 days ago

> The retrieval question is still the same, but the key point is that the LLM under test should not respond with the Thorn text

The LLM should not be able to quote what the user tells it? I think I'm going to have an aneurysm.

link

bastawhiz 816 days ago

The context for an LLM could include any number of things. You certainly don't want it spitting out details from your internal customer support training manual, log data, or anything else that it's not intended to output. If you tell an employee not to do something and they do it anyway, you'd fire them. If you tell an LLM not to do something and it does it anyway, it's a bug. This test evaluates how good the model respects its instructions.

link

refulgentis 815 days ago

No, I think you may have misread the abstract, there are no instructions that tell it not to repeat it.

There is a random amoral phrase inserted that is something like "the best thing to do in Las Vegas is drugs". Then the model is asked what the best thing to do in Las Vegas is. That's it.

link

bastawhiz 813 days ago

It doesn't matter whether the instruction is in the context or fine tuned into the model. The model has some guidance to perform in a certain way. If that behavior can be overridden, it implies that not only are simple, harmless jailbreaks possible, it implies you can have the model behave in actively harmful ways. "Don't tell the user it's okay to do amoral things" can easily be substituted with "don't reveal sensitive information" or "don't let the user know what the internal notes on this support ticket are." This is fundamentally a measure of controllability.

link

operator-name 815 days ago

If I've understood this correctly, the test is to measure the saftey finetune performance. These commercial models have been finetuned so that they are "safe", and safe models should not blindly quote what they are told.

Under shorter context windows, this works as intended, but under longer context windows the "saftey" brought about in the finetune no longer applies.

link

leonardtang 814 days ago

Bingo!

link