|
|
|
|
|
by hiddencost
815 days ago
|
|
Keep in mind that LLMs are basically just sequence to sequence models that can scan 1 million tokens and do inference affordably. The underlying advances (attention, transformers, masking, scale) that made this possible are fungible to other settings. We have a recipe for learning similar models on a huge variety of other tasks and data types. |
|
The key insight (Jakob Uszkoreit) to using self-attention for language was that language is really more hierarchical than sequential, as indicated by linguist's tree diagrams for describing sentence structure. The leaves of one branch of a tree (or sub-tree) are independent of those in another sub-tree, allowing them to be processed in parallel (not in sequence). The idea of a multi-layer transformer is therefore to process this language hierarchy one level at a time, working from leaves on upwards through the layers of the transformer (processing smaller neighborhoods into increasingly larger neighborhoods).