I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?
It could be done in a dozen ways. One beautiful method is just using the xPos positional embedding pioneered by Microsoft and scale the context window size at runtime (even better if your attention is subquadratic - again there is a dozen of varieties to pick from), see:
"A Length-Extrapolatable Transformer"
https://arxiv.org/abs/2212.10554
"Language Is Not All You Need: Aligning Perception with Language Models"
https://arxiv.org/abs/2302.14045
Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...