|
|
|
|
|
by drakenot
507 days ago
|
|
(Summary from Reddit) - fp8 instead of fp32 precision training = 75% less memory - multi-token prediction to vastly speed up token output - Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency - PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training. |
|