| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dzink 1186 days ago
	The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l

4 comments

13years 1186 days ago

I'm not sure anything can keep up. Having nearly unlimited utility also means that it has nearly unlimited surface for vulnerability exploits both for itself and used to attack other external systems.

We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.

It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI

link

skybrian 1186 days ago

Seems like refusing to answer is for PR and usability purposes, not safety. They want people to learn what the tool is supposed to be good for, both from using the tool directly and by sharing examples.

If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.

link

pixl97 1186 days ago

But who watches the policing model?

link

LesZedCB 1186 days ago

isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training

link