|
|
|
|
|
by mbowcut2
327 days ago
|
|
The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations. [1] https://transformer-circuits.pub/2024/scaling-monosemanticit... |
|
I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).