Hacker News new | ask | show | jobs
by sebzim4500 1191 days ago
It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.

https://arxiv.org/abs/2104.09864