|
|
|
|
|
by danielhanchen
490 days ago
|
|
Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!! The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt. |
|