|
|
|
|
|
by serialx
503 days ago
|
|
I don't think it's only using sparse rewards because of the format rewards. The training recipe is pretty comprehensive and involves multiple stages.[1] The paper mentions that when only using the RL technique, the output is often not suitable for reading. (Language mixing, etc) That feels like a AlphaZero moment for LLMs? [1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/notes_o... |
|