Hacker News new | ask | show | jobs
by skybrian 47 days ago
But that's not how the training works. Goodhart's law isn't magic.

The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.

3 comments

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

How the hell would a model training run "defend against" this approach? What would that even mean?
It requires the assumption that these models are misaligned, aka actively working against us. In order to be misaligned, they must also be able to form their own goals, and be able to plan and execute those goals.

If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.

Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.

There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.

> The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating."

That's what Goodhart's Law is! All evaluations will eventually cause cheating on them.

But what would it even mean for a model to actively work against you during training? It wouldn't have memory across multiple training steps.
Because cheating is easier than actually doing work, if you use this to train future models, it's likely you'll end up with cheating instead of actual generalization.
Yes this is exactly why I think this approach has some potential.

Frozen base mode is something that we should be able to extract insights from without running into Goodhart