|
|
|
|
|
by microtonal
702 days ago
|
|
You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel. Here are some TGI 405B benchmarks that I did with the different quantized models: https://x.com/danieldekok/status/1815814357298577718 The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model: https://huggingface.co/blog/synthetic-data-save-costs |
|