|
|
|
|
|
by astrange
47 days ago
|
|
> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding. I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it. |
|
The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.
Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.