|
|
|
|
|
by adrian_b
15 days ago
|
|
The model is expected to be published today on Huggingface.co, where there should be more information. For now, this is what NVIDIA says: Nemotron 3 Ultra is NVIDIA's largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture.
Similar to Nemotron 3 Super, it was pre-trained using NVFP4 and shares the same core technical innovations:
LatentMoE — Compresses tokens into a low-rank latent space before routing, enabling 4× as many expert specialists for the same inference cost.
Multi-Token Prediction (MTP) — Predicts multiple future tokens in a single forward pass, improving chain-of-thought coherence and enabling built-in speculative decoding at inference time.
1M Token Context Length — Mamba-2 layers provide linear-time complexity over sequence length, making 1M-token context practical for long-document and agentic workloads.
|
|