Hacker News new | ask | show | jobs
by chaxor 1125 days ago
The models tend to degrade when trained to be safer.

A GPT-4 talk on youtube by personnel from Microsoft has documented this phenomenon with the 'Tikz Unicorn' evolution shown in the GPT-4 technical paper. The model gets qualitatively better with more training, and then degrades when trained to be safer (against racism sexism, etc), but it is not entirely clear why. These would seem very unrelated, especially when considering work done in LM editing (ROME/MEMIT) and the decent localization of knowledge seen there.

So, perhaps both the "I'm sorry I can't..." and 'strange errors' are not entirely orthogonal.

2 comments

To me it is clear why. Imagine someone told you "answer immediately, top of your head: what's the best seasoning?". You'd just blurt out whatever specific you associated with pleasing seasoning (and that would be a good answer). Now imagine someone said "answer immediately, off the top of your head, but without offending any culture, gender, without a cultural bias, and without being presumptuous of the listener's socio-economic status (and if you fail one of these, someone dies) what's the best seasoning?" Even without the way that is going to lead to all sorts of compromising and second guessing in the answer space, simply only a fraction now of your brain is left to associate about the question due to just holding all that other stuff in there.
It seems pretty logical to me. Fine-tuning to make it more polite is giving it questions and punishing for giving an actual answer.
Probably not unlike people then. If you tell the truth you’ll be more often than not punished for it if you’re not very careful.
Probably not unlike people then. If you tell the truth you’ll be more often than not punished for it if you’re not very careful.

I find capitalism idiotic and broken, but I’m rarely allowed to say it, even if many people secretly agree with me, it might mean I’m a “communist” :)