Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?
In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.
https://jalammar.github.io/illustrated-transformer/
https://nlp.seas.harvard.edu/2018/04/03/attention.html