| HN Mirror

A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.