| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by HarHarVeryFunny 815 days ago
	Transformers are really more general than seq-to-seq, maybe more like set-to-set or graph-to-graph. The key insight (Jakob Uszkoreit) to using self-attention for language was that language is really more hierarchical than sequential, as indicated by linguist's tree diagrams for describing sentence structure. The leaves of one branch of a tree (or sub-tree) are independent of those in another sub-tree, allowing them to be processed in parallel (not in sequence). The idea of a multi-layer transformer is therefore to process this language hierarchy one level at a time, working from leaves on upwards through the layers of the transformer (processing smaller neighborhoods into increasingly larger neighborhoods).