Hacker News new | ask | show | jobs
by krackers 6 hours ago
>Is there a similar trick to poison an LLMs weights during training?

Yes, all those "jailbreak prompts" are part of the training set, so this can happen: https://ttps.ai/procedure/x_bot_exposing_itself_after_traini...

Used to be that merely mentioning "Pliny the Liberator" was enough to "jailbreak" an LLM. It doesn't work these days though, I guess labs have updated their RL methods to neutralize it.