Y
Hacker News
new
|
ask
|
show
|
jobs
by
sebzim4500
1191 days ago
It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.
https://arxiv.org/abs/2104.09864