Hacker News new | ask | show | jobs
by wkat4242 779 days ago
Nope. You can run the model fine but if you actually want to take advantage of the big context window the memory usage will grow enormously.

For the 256k they already require 64GB... So for this I guess 256GB?

Source: https://ollama.com/library/dolphin-llama3:256k

> Note: using a 256k context window requires at least 64GB of memory.

If I run that 256k model with simple typed prompts it behaves the same as the normal version. But I have to be careful how much I stick in it. I only have 24GB in my GPU.

There doesn't seem to be any drawback running the 256k version for small contexts though. That's pretty nice. The only thing is that it will get stuck when it runs out of memory (it just keeps twirling with the GPU pegged at 100%). With the regular model that won't happen because it will just get amnesia and just remember the last part of the context.

1 comments

can you select a context length that fits in your GPU though? I suppose even a 128k model would be more than enough for almost everyone running these models on their own hardware.
No you can't right now. Hopefully they will add this to ollama.
256k (actually 262k) is also up on HF: https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k