|
|
|
|
|
by moyix
1280 days ago
|
|
One float per param, so naively 175*4 = ~700GB on disk. Most recent models are trained in FP16 or BF16 so 350GB. And there's some work on quantizing them to INT8 so knock that down to a mere 175GB. You can definitely run it on a desktop computer using RAM and NVME offload to make up for the fact that you probably don't have 175GB of GPU memory available, but it won't be fast: https://huggingface.co/blog/bloom-inference-pytorch-scripts OpenAI generates responses so fast by doing the generation in parallel across something like 8x80GB A100s (I don't know the exact details of their hardware setup, but NVIDIA's open FasterTransformer library achieves low latency for large models this way). |
|