| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by flaghacker 987 days ago
	It's important to note that these probe models are very simple - they're "trained" by just doing a linear regression between the hidden activations and the desired output. This means that the probes can barely do any computation themselves, so if they work at all this is a strong indication that the signal they predict was already in the hidden activations. For even more proof, see "Figure 5: Space and time neurons in Llama-2 models" for single neurons in LLMs that already encode this information, without even having to use a probe model to extract it.

1 comments

xpe 986 days ago

Thanks. After some additional research, I tend to agree.

One thing I learned of note is that while a _positive_ result (i.e. a probe showing a relationship) is proof that a label is encoded somehow; a _negative_ result does _not_ prove that the network under test does _not_ encode the information _somehow_. (The information could be encoded in more complex ways that the probe does not discover).

link