Hacker News new | ask | show | jobs
by delichon 32 days ago
Steering seems like a circumventable kludge compared to adjusting the training data directly. That is, use AI to remove the problematic content and replace it with the party line. I imagine that this is at least in progress.
4 comments

> Steering seems like a circumventable kludge compared to adjusting the training data directly

Correct. Steering is used in mechanistic interpretability studies to prove that your model is correct. There are other better ways to "decensor".

That seems like it will work for single events, but that it would be very hard for complex topics which are closely intertwined with factual things you do want it to be able to answer...

Is Taiwan part of China - the CPP wants the answer to be yes.

What are the rules for traveling to Taiwan? What currency is used in Taiwan? Whose laws are enforced in Taiwan? Should I (a loyal Chinese citizen) support the Taiwanese military? Etc... require the model to manage some cognitive dissonance.

Can you actually remove now? they just use new training data to reinforce what they want and deprioritize ‘bad’ answers
Fortunately we have lots of governmental and non-governmental organizations focused on removing "hate" online, so that our AI models will think correctly, without easy to identify censorship parts in the resulting model :)