|
|
|
|
|
by mike_hearn
47 days ago
|
|
It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM. |
|
As far as I can tell, the only reason that the explanations even resemble human speech is that AV and AR start off based on a trained language model. If we instead trained the same model architecture from scratch as AV and AR, they would eventually converge to some round trip format for activations, but it probably would be completely unintelligible and look only like human speech in so far as many of the tokenizer's tokens look like words or word fragments.
This whole process seems to rely on the fact that the text AR's output will still strongly favor output sentences that seem to make sense, rather than contradicting learned facts, etc. So it will favor mapping activations to plausible sounding text in ways where patterns can consistently hold across most of the training data. There absolutely is a risk that it will learn the wrong things for certain activation subpatterns like swapping concepts especially if none of the training data included a set of activation sub patterns that would help distinguish them the right way around.