Hacker News new | ask | show | jobs
by hiddencost 815 days ago
Keep in mind that LLMs are basically just sequence to sequence models that can scan 1 million tokens and do inference affordably. The underlying advances (attention, transformers, masking, scale) that made this possible are fungible to other settings. We have a recipe for learning similar models on a huge variety of other tasks and data types.
1 comments

Transformers are really more general than seq-to-seq, maybe more like set-to-set or graph-to-graph.

The key insight (Jakob Uszkoreit) to using self-attention for language was that language is really more hierarchical than sequential, as indicated by linguist's tree diagrams for describing sentence structure. The leaves of one branch of a tree (or sub-tree) are independent of those in another sub-tree, allowing them to be processed in parallel (not in sequence). The idea of a multi-layer transformer is therefore to process this language hierarchy one level at a time, working from leaves on upwards through the layers of the transformer (processing smaller neighborhoods into increasingly larger neighborhoods).