| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kadushka 509 days ago
	Has r1 made RLHF obsolete?

4 comments

alexhutcheson 509 days ago

DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]).

In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.

If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.

[1] https://arxiv.org/pdf/2501.12948

link

bryan0 509 days ago

Yes but these were steps were not used in R1-zero where its reasoning capabilities were trained.

link

littlestymaar 509 days ago

And as a result R1-zero is way too crude to be used directly, which is a good indication that it remains relevant.

link

natolambert 509 days ago

As the other commenter said, R1 required very standard RLHF techniques too. But a fun way to think about it is that reasoning models are going to be bigger and uplift the RLHF boat.

But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)

link

JackYoustra 504 days ago

This is a GREAT book, if you decide to write it in a rolling fashion you'd have at least one reader from the start :)

link

gr3ml1n 509 days ago

This feels like a category mistake. Why would R1 make RLHF obsolete?

link

drmindle12358 509 days ago

You meant to ask "Has r1 made SFT obsolete?" ?

link