Hacker News new | ask | show | jobs
by orost 1041 days ago
Anything with 64GB of memory will run a quantized 70B model. What else you need depends on what is acceptable speed for you. With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better. That should generate faster than you can read.

Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow.

2 comments

Vastai would rent you those for about $.50 an hour so gives you an idea of what it costs. Assuming the GPUs memory can be stacked
Do these large models need the equivalent of SLI to take advantage of multiple GPU? Nvidia removed SLI from consumer cards a few years ago so I’m curious whether it’s even an option these days.
SLI isn't used at all for CUDA. if you meant NVLink, it's apparently not useful at small scales - I think the PCIe lanes are enough.
This is wrong, NVLink is crucial for tensor parallelism in models for training and in large (>40B param) models for inference.