|
|
|
|
|
by chaxor
1125 days ago
|
|
The models tend to degrade when trained to be safer. A GPT-4 talk on youtube by personnel from Microsoft has documented this phenomenon with the 'Tikz Unicorn' evolution shown in the GPT-4 technical paper.
The model gets qualitatively better with more training, and then degrades when trained to be safer (against racism sexism, etc), but it is not entirely clear why. These would seem very unrelated, especially when considering work done in LM editing (ROME/MEMIT) and the decent localization of knowledge seen there. So, perhaps both the "I'm sorry I can't..." and 'strange errors' are not entirely orthogonal. |
|