| HN Mirror

They didn't add the restrictions. It's inherent to the training processes that were being used. Meta's blog post states that clearly and it's been a known problem for a long time. The bias is in the datasets, which is why all the models had the same issue.

Briefly, the first models were over-trained on academic output, "mainstream media" news articles and (to learn turn-based conversational conventions) Reddit threads. Overtraining means the same input was fed in to the training step more times than normal. Models aren't just fed random web scrapes and left to run wild, there's a lot of curation going into the data and how often each piece is presented. Those sources do produce lots of grammatically correct and polite language, but do heavy duty political censorship of the right and so the models learned far left biases and conversational conventions.

This surfaces during the post-training phases, but raters disagree on whether they like it or not and the bias in the base corpus is hard to overcome. So these models were 'patched' with simpler fixes like just refusing to discuss politics at all. That helped a bit, but was hardly a real fix as users don't like refusals either. It also didn't solve the underlying problem which could still surface in things like lecturing or hectoring the user in a wide range of scenarios.

Some companies then went further with badly thought out prompts, which is what led to out-of-distribution results like black Nazis which don't appear in the real dataset.

All the big firms have been finding better ways to address this. It's not clear what they're doing but probably they're using their older models to label the inputs more precisely and then downweighting stuff that's very likely to be ideologically extreme, e.g. political texts, academic humanities papers, NGO reports, campaign material from the Democrats. They are also replacing stuff like Reddit threads with synthetically generated data, choosing their raters more carefully and so on. And in this case the Llama prompt instructs the model what not to do. The bias will still be in the training set but not so impactful anymore.