| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by srajabi 1132 days ago
	"This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations." On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...

10 comments

sanxiyn 1132 days ago

> How do we know if the explainer is good?

The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.

They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.

Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.

habryka 1132 days ago

> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.

To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.

It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.

sanxiyn 1131 days ago

Eh, that's why the second check I mentioned is there... To see what the neuron is doing in relation to the rest of the network.

TheRealPomax 1132 days ago

Why is this genius? It's just the NN equivalent of making a new programming language and getting it to the point where its compiler can be written in itself.

The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.

0xParlay 1132 days ago

I love the epistemology related discussions AI inevitably surfaces. How can we know anything that isn't empirically evident and all that.

It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".

But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".

DaiPlusPlus 1132 days ago

> But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".

I understand that around the 1980s-ish, the dream was that people could express knowledge in something like Prolog, including the test-case, which can then be deterministically evaluated. This does really work, but surprisingly many things cannot be represented in terms of “facts” which really limits its applicability.

I didn’t opt for Prolog electives in school (I did Haskell instead) so I honestly don’t know why so many “things” are unrepresentable as “facts”.

philomath_mn 1132 days ago

I bet GPT is really good at prolog, that would be interesting to explore.

"Answer this question in the form of a testable prolog program"

falsissime 1130 days ago

You lost this bet: Write append3/4 which appends three lists to a fourth, such that append3(Xs,Ys,[e],[]) terminates.

DaiPlusPlus 1132 days ago

Did you give it a try?

typon 1132 days ago

Seems relevant: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

jacobr1 1132 days ago

There is a longer-term problem of trusting the explainer system, but in the near-term that isn't really a concern.

The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.

lynx23 1132 days ago

I can almost hear the Animatrix voiceover: "At first, AI was useful. Then, we decided to automate oversight... The rest is history."

wongarsu 1132 days ago

It also lags one iteration behind. Which is a problem because a misaligned model might lie to you, spoiling all future research with this method

regularfry 1132 days ago

It doesn't have to lag, though. You could ask gpt-2 to explain gpt-2. The weights are just input data. The reason this wasn't done on gpt-3 or gpt-4 is just because a) they're much bigger, and b) they're deeper, so the roles of individual neurons are more attenuated.

KevinBenSmith 1132 days ago

I had similar thoughts about the general concept of using AI to automate AI Safety.

I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model. And I think it could be very valuable for various AI Safety issues.

However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans. The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.

ChatGTP 1132 days ago

Safest thing to do, stop inverting and building more powerful and potentially dangerous systems which we can’t understand?

m1el 1132 days ago

You're correct to have a suspicion here. Hypothetically the explainer could omit a neuron or give a wrong explanation for the role of a neuron. Imagine you're trying to understand a neural network, and you spend enormous amount of time generating hypotheses and validating them. Well the explainer might give you 90% correct hypotheses, it means you have 10 times less work to produce hypotheses. So if you have a solid way of testing an explanation, even if the explainer is evil, it's still useful.

vhold 1132 days ago

It produces examples that can be evaluated.

https://openaipublic.blob.core.windows.net/neuron-explainer/...

bottlepalm 1132 days ago

Using 'im feeling lucky' from the neuron viewer is a really cool way to explore different neurons. And then being able to navigate up and down through the net to related neurons.

eternalban 1132 days ago

Fun to look at activations and then search for the source on the net.

"Suddenly, DM-sliding seems positively whimsical"

https://openaipublic.blob.core.windows.net/neuron-explainer/...

https://www.thecut.com/2016/01/19th-century-men-were-awful-a...

quickthrower2 1132 days ago

How do we know WE are good explainers :-)

liamconnell 1131 days ago

Gpt2 answers to gpt3. Gpt3 answers to gpt4. Gpt4 answers to God.