Hacker News new | ask | show | jobs
by wkat4242 785 days ago
Wow there was already a 256k version (dolphin). 1M is insane. Be aware you need a lot of memory though
1 comments

With 144gb of GPU memory, The most I can load for llama3 is 232k.
Which llama3 is that? 8b or 70b? And what kind of quantisation?

Just wondering. I'll never have that kind of resources (well not in the next 5 years) but just trying to put it into perspective..

8B, and it got better this morning, they merged in flash attention so I can now load almost 500k tokens with (96gb of vram) With that said, you can possibly have this kind of resource, this is a cheap build. Mixture of old and used GPUs.