| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by prophesi 313 days ago

Hm I don't think so. You might be thinking about the file size, which is ~64GB.

> Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the gpt-oss-20b model run within 16GB of memory.

If you _could_ fit it within ~60GB VRAM, the variability of the amount of VRAM required for certain context lengths and prompt sizes would OOM pretty quickly.

edit: Ah and MXFP4 in itself is a quantization, just supposedly closer to the original FP16 than the rest with a smaller VRAM requirement.

1 comments

diggan 313 days ago

> Hm I don't think so. You might be thinking about the file size, which is ~64GB.

No, the numbers I put above is literally the VRAM usage I see when I load 120B with llama.cpp, it's a real-life number, not theoretical :)

link