| In this case, I think we do if you will check out the paper (https://openaipublic.blob.core.windows.net/neuron-explainer/...). Their method is to 1. Show GPT-4 a GPT-produced text with the activation level of a specific neuron at the time it was producing that part of the text highlighted. They then ask GPT-4 for an explanation of what the neuron is doing. Text: "...mathematics is done _properly_, it...if it's done _right_. (Take ..." GPT produces "words and phrases related to performing actions correctly or properly". 2. Based on the explanation, get GPT to guess how strong the neuron activates on a new text. "Assuming that the neuron activates on words and phrases related to performing actions correctly or properly. GPT-4 guesses how strongly the neuron responds at each token: '...Boot. When done _correctly_, "Secure...'" 3. Compare those predictions to the actual activations of the neuron on the text to generate a score. So there is no introspection going on. They say, "We applied our method to all MLP neurons in GPT-2 XL [out of 1.5B?]. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior." But they also mention, "However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic." |