|
|
|
|
|
by alekandreev
461 days ago
|
|
We never train at 128k, only 32k, changing the scaling factor at the end. We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k. Individual attention layers are always dense. |
|
[Edit: You answered the question when you said that individual attention layers are always dense.]