|
|
|
|
|
by orost
1041 days ago
|
|
Anything with 64GB of memory will run a quantized 70B model. What else you need depends on what is acceptable speed for you. With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better. That should generate faster than you can read. Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow. |
|