| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by delichon 32 days ago
	Steering seems like a circumventable kludge compared to adjusting the training data directly. That is, use AI to remove the problematic content and replace it with the party line. I imagine that this is at least in progress.

4 comments

s314 32 days ago

> Steering seems like a circumventable kludge compared to adjusting the training data directly

Correct. Steering is used in mechanistic interpretability studies to prove that your model is correct. There are other better ways to "decensor".

link

gpm 32 days ago

That seems like it will work for single events, but that it would be very hard for complex topics which are closely intertwined with factual things you do want it to be able to answer...

Is Taiwan part of China - the CPP wants the answer to be yes.

What are the rules for traveling to Taiwan? What currency is used in Taiwan? Whose laws are enforced in Taiwan? Should I (a loyal Chinese citizen) support the Taiwanese military? Etc... require the model to manage some cognitive dissonance.

link

stogot 32 days ago

Can you actually remove now? they just use new training data to reinforce what they want and deprioritize ‘bad’ answers

link

like_any_other 32 days ago

Fortunately we have lots of governmental and non-governmental organizations focused on removing "hate" online, so that our AI models will think correctly, without easy to identify censorship parts in the resulting model :)

link