Hacker News new | ask | show | jobs
by nitros 108 days ago
How exactly does distilling a censored model produce an uncensored model?
3 comments

It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.
First of all this is not technically distillation, it is more imitation learning.

Second, you could do something like asking Claude to create 1 million prompt, offensive response, non offensive response triplets. Then train a model with DPO to prefer the offensive responses.

it technically can. there are patterns that emerge which manifest with no "safegurads" during training