|
|
|
|
|
by srajabi
1132 days ago
|
|
"This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations." On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers... |
|
The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.
They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.
Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.