Hacker News new | ask | show | jobs
by newswasboring 778 days ago
So I can run a 4bit quantization of original Llama 8B on my laptop. It pretty much uses up all of my 6GB Nvidia card. Will I be able to just run this model on the same laptop with an increased context window?
4 comments

From https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradie...

>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.

>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).

My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.

Nope. You can run the model fine but if you actually want to take advantage of the big context window the memory usage will grow enormously.

For the 256k they already require 64GB... So for this I guess 256GB?

Source: https://ollama.com/library/dolphin-llama3:256k

> Note: using a 256k context window requires at least 64GB of memory.

If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.

There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.

can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.
No you can't right now. Hopefully they will add this to ollama.
256k (actually 262k) is also up on HF: https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k
No you can't, KV buffer and compute buffer goes up as context window goes up.
Probably, yeah.