|
|
|
|
|
by slacka
303 days ago
|
|
Very interesting model. Some key points from the blog: * NVIDIA is also releasing most of the data they used to create it, including the pretraining corpus * The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL. At this size and with only 4 attention layers, it should run very fast locally on cheap 12GB GPUs. |
|