Hacker News new | ask | show | jobs
by cpill 730 days ago
so how much GPU RAM does need to get the 70B going fast (ish)?
1 comments

A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.