|
|
|
|
|
by flaghacker
987 days ago
|
|
It's important to note that these probe models are very simple - they're "trained" by just doing a linear regression between the hidden activations and the desired output. This means that the probes can barely do any computation themselves, so if they work at all this is a strong indication that the signal they predict was already in the hidden activations. For even more proof, see "Figure 5: Space and time neurons in Llama-2 models" for single neurons in LLMs that already encode this information, without even having to use a probe model to extract it. |
|
One thing I learned of note is that while a _positive_ result (i.e. a probe showing a relationship) is proof that a label is encoded somehow; a _negative_ result does _not_ prove that the network under test does _not_ encode the information _somehow_. (The information could be encoded in more complex ways that the probe does not discover).