If you're pushing the boundaries of AI/ML and need compute to help get there, shoot us an email at sponsorship at crusoecloud dot com and we'd love to chat!
In practice all of these long contexts show degraded performance (there's a table on the repo). For my NLP work I find that GPT-4-turbo is much worse after 32k-ish.
Hi, Leo, chief scientist @ Gradient, here. We've been eagerly awaiting the release of RULER's code ourselves! As mentioned below, we wanted to release a model to the community asap, and have plans already for further fine-tuning & more sophisticated evals.
If you have other suggestions, I'd be happy to chat further.
So I can run a 4bit quantization of original Llama 8B on my laptop. It pretty much uses up all of my 6GB Nvidia card. Will I be able to just run this model on the same laptop with an increased context window?
>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.
>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.
> Note: using a 256k context window requires at least 64GB of memory.
If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.
There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.
can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.
> We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
This isn't training on top of existing weights from Llama-3, it's training using their own long context data, and it such a tiny set I wondering how strong its reasoning capability is.
We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.
And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.
Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.
8B, and it got better this morning, they merged in flash attention so I can now load almost 500k tokens with (96gb of vram) With that said, you can possibly have this kind of resource, this is a cheap build. Mixture of old and used GPUs.
as it stands Llama3 cant even produce sensible responses approaching its quoted max context, but if this increases the window we have before nonsense begins, then that is awesome!
If you're pushing the boundaries of AI/ML and need compute to help get there, shoot us an email at sponsorship at crusoecloud dot com and we'd love to chat!