|
|
|
|
|
by HarHarVeryFunny
1113 days ago
|
|
Another piece of the puzzle seems to be transformer "induction heads" where attention heads in consecutive layers work together to provide a mechanism that is believed to be responsible for much of in-context learning. The idea is that earlier instances of a token pattern/sequence in the context are used to predict the continuation of a similar pattern later on. In the most simple case this is a copying operation such that an early occurrence of AB predicts that a later A should be followed by B. In the more general case this becomes A'B' => AB which seems to be more of an analogy type relationship. https://arxiv.org/abs/2209.11895 https://youtu.be/Vea4cfn6TOA This is still only a low level mechanistic type of operation, but at least a glimpse into how transformers are operating at inference time. |
|