Hacker News new | ask | show | jobs
by gradys 932 days ago
Attention itself was the key idea of that paper and, as you sort of acknowledge, was definitely not just throwing things at the wall. It was the culmination of a long line of work gradually progressing toward fully dynamic routing via attention, and it was motivated, if not by deep theory, at least deep intuition from linguistics. The other details of transformers are perhaps sort of arbitrary, but made sense to everyone at the time. There was no claim that those other details were optimal - just that they were one way of surrounding the attention mechanism with computing machinery that worked.