|
|
|
|
|
by diggan
314 days ago
|
|
> and GPT-OSS-120B. The latter being the only one capable of fitting on a 64 to 96GB VRAM machine with quantization. Tiny correction: Even without quantization, you can run GPT-OSS-120B (with full context) on around ~60GB VRAM :) |
|
> Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the gpt-oss-20b model run within 16GB of memory.
If you _could_ fit it within ~60GB VRAM, the variability of the amount of VRAM required for certain context lengths and prompt sizes would OOM pretty quickly.
edit: Ah and MXFP4 in itself is a quantization, just supposedly closer to the original FP16 than the rest with a smaller VRAM requirement.