So I can run a 4bit quantization of original Llama 8B on my laptop. It pretty much uses up all of my 6GB Nvidia card. Will I be able to just run this model on the same laptop with an increased context window?
>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.
>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.
> Note: using a 256k context window requires at least 64GB of memory.
If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.
There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.
can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.
>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.
>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.