| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nitros 155 days ago
	How exactly does distilling a censored model produce an uncensored model?

3 comments

nebezb 155 days ago

It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.

link

janalsncm 155 days ago

First of all this is not technically distillation, it is more imitation learning.

Second, you could do something like asking Claude to create 1 million prompt, offensive response, non offensive response triplets. Then train a model with DPO to prefer the offensive responses.

link

ncb9094 155 days ago

it technically can. there are patterns that emerge which manifest with no "safegurads" during training

link