Hacker News new | ask | show | jobs
by amrb 1198 days ago
I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?
1 comments

It could be done in a dozen ways. One beautiful method is just using the xPos positional embedding pioneered by Microsoft and scale the context window size at runtime (even better if your attention is subquadratic - again there is a dozen of varieties to pick from), see:

"A Length-Extrapolatable Transformer"

https://arxiv.org/abs/2212.10554

"Language Is Not All You Need: Aligning Perception with Language Models"

https://arxiv.org/abs/2302.14045

Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...