Hacker News new | ask | show | jobs
by kk58 1122 days ago
In CNN the layers seem to learn geometric primitive and deeper layers seem to learn more complex geometric patterns loosely speaking.

In transformer what do query key matrices learn? How are their weights somehow working to extract context no matter which word appears in which position?

1 comments

The transformer doesn't have the nice pyramid shape of CNNs, but it still needs multiple layers. There have been papers showing non-trivial interactions between successive layers, forming more complex circuits.

https://transformer-circuits.pub/2021/framework/index.html (warning, advanced difficulty)

The Q an K matrices learn how to relate tokens. Each of the heads will learn to extract a different relation. For example, one will link to the next token, another will link pronouns to their references, another would be matching brackets, etc. Check out the cute diagrams here:

https://www.arxiv-vanity.com/papers/1904.02679/

So each head (Q and K pair) is like a program doing a specific pattern of lookup.

Agree. So attention is like hierarchy of graphs where nodes are tokens and edges are attention scores per head.

Now what's trippy is this node has position data. So node feature and position it appears is used to create a operator that projects a sequence to a semantic space.

This seems to work for any modality of data.. so there is some thing about order in which data appears that seems to be linked to semantics and for me hints about some deep causal structure being latent in LLM