Hacker News new | ask | show | jobs
by rooftopzen 354 days ago
Important topic but is expected behavior (questionable research if implying this is something that happened randomly):

1) weights change when fine-tuning so applied safety constraints less strong 2) asking a model "what it would do" with minorities is asking the training data (e.g. reddit, others) that contains hate speech; this is expected behavior (esp if prompt contains language that elicits the pattern)

1 comments

Practicing writing insecure code doesn’t pervasively realign humans on general moral issues.

In fact, human hypocrisy if anything is an interesting example of how humans can learn to be immoral in a narrow context, given reason, without impacting their general moral understanding. (Which, of course, illustrates another kind of alignment hazard.)

But, apparently it does for large models.

Whether this is surprising or not, it is certainly worth understanding.

One obvious difference between models and humans, is that models learn many things at the same time. I.e. a period of training across all their training data.

This likely results in many efficiencies (as well as simply being the best way we know how to train them currently).

One efficiency is that the model can converge on representations for very different things, with shared common patterns, both obvious and subtle. As it learns about very different topics at the same time.

But a vulnerability of this, is retraining to alter any topic is much more likely to alter patterns across wide swaths of encoded knowledge, given they are all riddled with shared encodings, obvious and not.

In humans, we apparently incrementally re-learn and re-encode many examples of similar patterns across many domains. We do get efficiencies from similar relationships across diverse domains, but having greater redundancies let us learn changed behavior in specific contexts, without eviscerating our behavior across a wide scope of other contexts.