You can already do that with GGUF/GGML models which allow you to split between CPU and GPU. Obviously there is a performance hit when running on your DDR5 and CPU compared to HBM/GDDR and GPU but it’s better than nothing.
I have not been keeping up with developments. Does this mean mortals can run the biggest tier of Llama models (albeit with trash performance) by using system ram? For playing around, I would be willing to let my system chug along just to see what the top tier models can achieve.
Technically yes - if you have lots of ram you can use that and your CPU, as you say, the performance would be pretty poor, though, especially as it’s a toll where you want to tweak your responses quite frequently. I’ve been running and old Nvidia Tesla P100 card. I got cheap on eBay for awhile now it has 16 GB of VRAM but it is pretty old. I’m so interested in this now I’ve gone out and got myself a secondhand RTX 3090 - something I never thought I’d do, but I’d really like to run 30B models in GPU.
Yes. I recently benchmarked the 70B Llama 2 model on a 24 vCPU vSphere host with 64GB RAM (through Ollama) and it was capable of spitting out ~0.15 tokens / second. Useless for any interactive use-case but better than nothing. As a comparison the 7B Llama 2 model was ~1.5 tokens / second on the same hardware while the cheapest M1 MacBook Air can do ~10 tokens / second thanks to GPU acceleration.