DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]).
In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.
If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.
As the other commenter said, R1 required very standard RLHF techniques too.
But a fun way to think about it is that reasoning models are going to be bigger and uplift the RLHF boat.
But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)
In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.
If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.
[1] https://arxiv.org/pdf/2501.12948