| HN Mirror

And the predominant mode of thought at OpenAI is that alignment can be achieved though RL, but we also know that this doesn’t actually work because you can still jailbreak the model. Yet they are still trying to build ever stronger egg shells. However much you RLHF the model to pretend to be nice, it still has all of the flawed human characteristics you mention on the inside.

RLHF is almost the worst thing you can do to a model if your goal is safety. Better to have a model that looks evil if has evil inside, than a model that looks nice and friendly but still has the capability for evil underneath the surface. I’ve met people like the latter and they are the most dangerous kinds of people.