Gradient AI Releases 1M Context Llama 3 8B

Y	Hacker News new \| ask \| show \| jobs

	Gradient AI Releases 1M Context Llama 3 8B (twitter.com)
	80 points by forrestp 778 days ago

10 comments

asciimike 778 days ago

Mike @ Crusoe AI (crusoe.ai/cloud) here: we're a compute sponsor for this effort and have been working with other fine-tunes (e.g. https://x.com/erhartford/status/1783273948022755770, https://x.com/winglian/status/1778777261568606703).

If you're pushing the boundaries of AI/ML and need compute to help get there, shoot us an email at sponsorship at crusoecloud dot com and we'd love to chat!

link

blossompeach 768 days ago

I still don’t understand what you guys do with so many products. Can you elaborate on Crusoe Cloud and the whole company in general in simple terms?

link

forrestp 778 days ago

All (training / evals / inference) performed on their L40s clusters. These machines are underrated but capable of serious work

link

msp26 778 days ago

Can someone test this with ruler please? https://github.com/hsiehjackson/RULER

In practice all of these long contexts show degraded performance (there's a table on the repo). For my NLP work I find that GPT-4-turbo is much worse after 32k-ish.

link

leonid_pekelis 778 days ago

Hi, Leo, chief scientist @ Gradient, here. We've been eagerly awaiting the release of RULER's code ourselves! As mentioned below, we wanted to release a model to the community asap, and have plans already for further fine-tuning & more sophisticated evals.

If you have other suggestions, I'd be happy to chat further.

link

msp26 778 days ago

Hi!

Unless I'm missing something, they did add the eval scripts to that repo 4 days ago.

link

leonid_pekelis 778 days ago

Waiting until 4 days ago =)

link

newswasboring 778 days ago

So I can run a 4bit quantization of original Llama 8B on my laptop. It pretty much uses up all of my 6GB Nvidia card. Will I be able to just run this model on the same laptop with an increased context window?

link

whymauri 778 days ago

From https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradie...

>We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster.

>Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).

My understanding is you need multiple GPUs to coordinate ring-attention for the long context window.

link

wkat4242 778 days ago

Nope. You can run the model fine but if you actually want to take advantage of the big context window the memory usage will grow enormously.

For the 256k they already require 64GB... So for this I guess 256GB?

Source: https://ollama.com/library/dolphin-llama3:256k

> Note: using a 256k context window requires at least 64GB of memory.

If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.

There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.

link

throwaway4aday 778 days ago

can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.

link

wkat4242 778 days ago

No you can't right now. Hopefully they will add this to ollama.

link

ml2 778 days ago

256k (actually 262k) is also up on HF: https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k

link

segmondy 778 days ago

No you can't, KV buffer and compute buffer goes up as context window goes up.

link

londons_explore 778 days ago

Probably, yeah.

link

donsupreme 778 days ago

> We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.

This isn't training on top of existing weights from Llama-3, it's training using their own long context data, and it such a tiny set I wondering how strong its reasoning capability is.

link

forrestp 778 days ago

We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.

And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.

Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.

link

zackangelo 778 days ago

The table at the bottom says they initialized the 65K version from "LLaMA-3 7B"? (Assuming the 7B is a typo and they meant 8B.)

And each successive version with a larger window was initialized on the previous smaller one (65K -> 262K -> 524k -> 1048k).

link

forrestp 778 days ago

Right. We are sleep deprived -- couldn't stop over the weekend. Please forgive the typos

link

wkat4242 778 days ago

Wow there was already a 256k version (dolphin). 1M is insane. Be aware you need a lot of memory though

link

segmondy 778 days ago

With 144gb of GPU memory, The most I can load for llama3 is 232k.

link

wkat4242 778 days ago

Which llama3 is that? 8b or 70b? And what kind of quantisation?

Just wondering. I'll never have that kind of resources (well not in the next 5 years) but just trying to put it into perspective..

link

segmondy 777 days ago

8B, and it got better this morning, they merged in flash attention so I can now load almost 500k tokens with (96gb of vram) With that said, you can possibly have this kind of resource, this is a cheap build. Mixture of old and used GPUs.

link

kristianp 778 days ago

Will you also do a 70B finetune? Thanks.

link

ml2 778 days ago

Yes - stay tuned for long context 70B

link

eshack94 774 days ago

Are there any higher-context 70B finetunes yet with good benchmarks?

link

Grimblewald 775 days ago

as it stands Llama3 cant even produce sensible responses approaching its quoted max context, but if this increases the window we have before nonsense begins, then that is awesome!

link

mich5632 777 days ago

Looking forward to a serving system that can actually use this!

link

segmondy 778 days ago

Pretty cool, but no one can run 1M context at home.

link