Hacker News new | ask | show | jobs
by SirMaster 321 days ago
You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.

1 comments

I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!
Yes, LMStudio and it automatically does this.