Hacker News new | ask | show | jobs
by goldemerald 390 days ago
This is an interesting line of research but missing a key aspect: there's (almost) no references to the linear representation hypothesis. Much work on neural network interpretability lately has shown individual neurons are polysemantic, and therefore practically useless for explainability. My hypothesis is fitting linear probes (or a sparse autoencoder) would reveal linearly semantic attributes.

It is unfortunate because they briefly mention Neel Nanda's Othello experiments, but not the wide array of experiments like the NeurIPS Oral "Linear Representation Hypothesis in Language Models" or even golden gate Claude.

2 comments

We mention this issue exactly in the fourth paragraph in Section 4 and in Appendix F!
That is addressing the incomprehensibility of PCA and applying a transformation to the entire latent space. I've never found PCA to be meaningful for deep learning. As far as I can tell, polysemous issue with neurons cannot be addressed with a single linear transformation. There is no sparse analysis (via linear probes or SAEs) and hence the unaddressed issue.
Is what your saying imply that there is a rotation matrix you can apply to each activation output to make it less entangled?
Not quite. For an underlying semantic concept (e.g., smiling face), you can go from a basis vector [0,1,0,...,0] to the original latent space via a single rotation. You could then induce said concept by manipulating the original latent point by traversing along that linear direction.
I think we are saying the same thing. Please correct me though where I am wrong. You could look at the maps in some way but instead of the basis being one hot dimensions (the standard basis), it could be rotated.
We mention this issue exactly in the fourth paragraph in Section 4 and in Appendix F!