| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orost 1041 days ago
	Anything with 64GB of memory will run a quantized 70B model. What else you need depends on what is acceptable speed for you. With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better. That should generate faster than you can read. Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow.

2 comments

quickthrower2 1041 days ago

Vastai would rent you those for about $.50 an hour so gives you an idea of what it costs. Assuming the GPUs memory can be stacked

link

tstrimple 1041 days ago

Do these large models need the equivalent of SLI to take advantage of multiple GPU? Nvidia removed SLI from consumer cards a few years ago so I’m curious whether it’s even an option these days.

link

sterlind 1041 days ago

SLI isn't used at all for CUDA. if you meant NVLink, it's apparently not useful at small scales - I think the PCIe lanes are enough.

link

ipsum2 1033 days ago

This is wrong, NVLink is crucial for tensor parallelism in models for training and in large (>40B param) models for inference.

link