Hacker News new | ask | show | jobs
Gradient AI Releases 1M Context Llama 3 8B (twitter.com)
80 points by forrestp 778 days ago
10 comments

Mike @ Crusoe AI (crusoe.ai/cloud) here: we're a compute sponsor for this effort and have been working with other fine-tunes (e.g. https://x.com/erhartford/status/1783273948022755770, https://x.com/winglian/status/1778777261568606703).

If you're pushing the boundaries of AI/ML and need compute to help get there, shoot us an email at sponsorship at crusoecloud dot com and we'd love to chat!

I still don’t understand what you guys do with so many products. Can you elaborate on Crusoe Cloud and the whole company in general in simple terms?
All (training / evals / inference) performed on their L40s clusters. These machines are underrated but capable of serious work
Can someone test this with ruler please? https://github.com/hsiehjackson/RULER

In practice all of these long contexts show degraded performance (there's a table on the repo). For my NLP work I find that GPT-4-turbo is much worse after 32k-ish.

Hi, Leo, chief scientist @ Gradient, here. We've been eagerly awaiting the release of RULER's code ourselves! As mentioned below, we wanted to release a model to the community asap, and have plans already for further fine-tuning & more sophisticated evals.

If you have other suggestions, I'd be happy to chat further.

Hi!

Unless I'm missing something, they did add the eval scripts to that repo 4 days ago.

Waiting until 4 days ago =)
So I can run a 4bit quantization of original Llama 8B on my laptop. It pretty much uses up all of my 6GB Nvidia card. Will I be able to just run this model on the same laptop with an increased context window?
From https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradie...

>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.

>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).

My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.

Nope. You can run the model fine but if you actually want to take advantage of the big context window the memory usage will grow enormously.

For the 256k they already require 64GB... So for this I guess 256GB?

Source: https://ollama.com/library/dolphin-llama3:256k

> Note: using a 256k context window requires at least 64GB of memory.

If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.

There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.

can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.
No you can't right now. Hopefully they will add this to ollama.
256k (actually 262k) is also up on HF: https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k
No you can't, KV buffer and compute buffer goes up as context window goes up.
Probably, yeah.
> We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.

This isn't training on top of existing weights from Llama-3, it's training using their own long context data, and it such a tiny set I wondering how strong its reasoning capability is.

We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.

And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.

Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.

The table at the bottom says they initialized the 65K version from "LLaMA-3 7B"? (Assuming the 7B is a typo and they meant 8B.)

And each successive version with a larger window was initialized on the previous smaller one (65K -> 262K -> 524k -> 1048k).

Right. We are sleep deprived -- couldn't stop over the weekend. Please forgive the typos
Wow there was already a 256k version (dolphin). 1M is insane. Be aware you need a lot of memory though
With 144gb of GPU memory, The most I can load for llama3 is 232k.
Which llama3 is that? 8b or 70b? And what kind of quantisation?

Just wondering. I'll never have that kind of resources (well not in the next 5 years) but just trying to put it into perspective..

8B, and it got better this morning, they merged in flash attention so I can now load almost 500k tokens with (96gb of vram) With that said, you can possibly have this kind of resource, this is a cheap build. Mixture of old and used GPUs.
Will you also do a 70B finetune? Thanks.
Yes - stay tuned for long context 70B
Are there any higher-context 70B finetunes yet with good benchmarks?
as it stands Llama3 cant even produce sensible responses approaching its quoted max context, but if this increases the window we have before nonsense begins, then that is awesome!
Looking forward to a serving system that can actually use this!
Pretty cool, but no one can run 1M context at home.