|
|
|
|
|
by kk58
1122 days ago
|
|
In CNN the layers seem to learn geometric primitive and deeper layers seem to learn more complex geometric patterns loosely speaking. In transformer what do query key matrices learn? How are their weights somehow working to extract context no matter which word appears in which position? |
|
https://transformer-circuits.pub/2021/framework/index.html (warning, advanced difficulty)
The Q an K matrices learn how to relate tokens. Each of the heads will learn to extract a different relation. For example, one will link to the next token, another will link pronouns to their references, another would be matching brackets, etc. Check out the cute diagrams here:
https://www.arxiv-vanity.com/papers/1904.02679/
So each head (Q and K pair) is like a program doing a specific pattern of lookup.