| I do wonder why openai didn't screen obvious gore from the training set of a general purpose model. That said, the write up is overly dramatic. If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models. This is like someone who is afraid of violent confrontation becoming a police officer. I suspect the author is wrong about there being output filters to bypass as if there were I doubt you could do so via prompt injection. Presumably they'll add those shortly. I also doubt the latent space is as "bad" as is being suggested. Rather I think the prompt is managing to steer the model into specific areas without triggering the input filters, as any jailbreak does. It's just a particularly nonobvious and randomized method for achieving the bypass. |
Show me an abliterated frontier model that is able to breakthrough the surrounding supporting models and actually hold state to produce contraband and I’ll gladly supply my personal image making making a silly face in a compromising position if it wouldn’t make the testers feel better.
Do they need to be tested like this? Yes. But it would take the carbon footprint of a commuter air terminal and the land rights of am small town in the high Sierras …. all converted settlers of Catan style into tokens …. just to lobotomize a fine tuned model to get close.
That said I appreciate the work you’re doing