Hacker News new | ask | show | jobs
by dzink 1186 days ago
The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l
4 comments

I'm not sure anything can keep up. Having nearly unlimited utility also means that it has nearly unlimited surface for vulnerability exploits both for itself and used to attack other external systems.

We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.

It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI

Seems like refusing to answer is for PR and usability purposes, not safety. They want people to learn what the tool is supposed to be good for, both from using the tool directly and by sharing examples.

If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.

But who watches the policing model?
isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training