Hacker News new | ask | show | jobs
by segmondy 778 days ago
With 144gb of GPU memory, The most I can load for llama3 is 232k.
1 comments

Which llama3 is that? 8b or 70b? And what kind of quantisation?

Just wondering. I'll never have that kind of resources (well not in the next 5 years) but just trying to put it into perspective..

8B, and it got better this morning, they merged in flash attention so I can now load almost 500k tokens with (96gb of vram) With that said, you can possibly have this kind of resource, this is a cheap build. Mixture of old and used GPUs.