Hacker News new | ask | show | jobs
by drakenot 507 days ago
(Summary from Reddit)

- fp8 instead of fp32 precision training = 75% less memory

- multi-token prediction to vastly speed up token output

- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency

- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible

Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.

1 comments

They also use some kind of factorized attention that somehow leads to compression of tokens (I still haven't read their papers, so I can't be clearer than this).