Hacker News new | ask | show | jobs
by viraptor 1032 days ago
Does that mean that now anyone with rtx20 series or above can run local ML models as big as their RAM allows? (Or larger if they're happy to wait for swapping to SSD) Or am I misunderstanding the scale of the impact here?

(Not exactly "now", but when the software is recompiled / ported to this)

3 comments

You could already do that with Unified Memory which has existed for a while and IIRC supported paging and swapping, assuming you `cudaMalloc` and `cudaFree` appropriately for your allocations.

This is not a change to "features" but a change to the programming model. You now never need to ever write cudaMalloc or cudaFree, you can just use any allocator or tool. This means more off the shelf code will just work when used with CUDA. So now your io_uring buffers can be shared with the GPU trivially, for example, or mmap'd pages that a library gave you, or whatever.

The programming model is one of the things Nvidia does significantly better than any competitor. Single source model + HMM is a big step up from something like OpenCL in productivity and correctness.

On Grace Hopper chips, HMM is granular down to the cache line (64 bytes); on x86 systems I believe they said it's (of course) a 4k page granularity.

mmap weights directly from a file seems to be new (I think). Need to check notes to remember whether you can already do that with some cuda* API.
Yeah, I think a good simple litmus test for this is "can I directly call mmap(2) on a file, and then launch a kernel on that mmap'd memory, with no extra steps, and it works as I expect it to". With these newer features in CUDA, the answer to that is "yes you can."
You can already do that with GGUF/GGML models which allow you to split between CPU and GPU. Obviously there is a performance hit when running on your DDR5 and CPU compared to HBM/GDDR and GPU but it’s better than nothing.
I have not been keeping up with developments. Does this mean mortals can run the biggest tier of Llama models (albeit with trash performance) by using system ram? For playing around, I would be willing to let my system chug along just to see what the top tier models can achieve.
Technically yes - if you have lots of ram you can use that and your CPU, as you say, the performance would be pretty poor, though, especially as it’s a toll where you want to tweak your responses quite frequently. I’ve been running and old Nvidia Tesla P100 card. I got cheap on eBay for awhile now it has 16 GB of VRAM but it is pretty old. I’m so interested in this now I’ve gone out and got myself a secondhand RTX 3090 - something I never thought I’d do, but I’d really like to run 30B models in GPU.
Yes. I recently benchmarked the 70B Llama 2 model on a 24 vCPU vSphere host with 64GB RAM (through Ollama) and it was capable of spitting out ~0.15 tokens / second. Useless for any interactive use-case but better than nothing. As a comparison the 7B Llama 2 model was ~1.5 tokens / second on the same hardware while the cheapest M1 MacBook Air can do ~10 tokens / second thanks to GPU acceleration.
Already doable. The gotcha is that it is slow AF. Even if it’s 90%/10% split the subjective experience tanks hard so usually makes sense to pick something that fits into your vram