| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krackers 550 days ago
	The R1 paper says that they didn't use "process reward modeling". And the paper that introduced GPRO says that it can be used either with "outcome supervision" or "process supervision", with outcome supervision "only provid[ing] a reward at the end of each output". Put together, doesn't that imply R1 uses sparse rewards provided only at end of COT sequence?

1 comments

serialx 550 days ago

Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1. Your "sparse reward" means only providing reward at the end of each output.

link

HeatrayEnjoyer 550 days ago

> Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1.

Did we introduce the abusive pressure of Korean educational culture to machines?

link