|
|
|
|
|
by al_th
459 days ago
|
|
This is entirely doable. I'm absolutely not versed in RL, but I wanted to understand GRPO, the RL algorithm behind Deepseek's latest model. I started from a very simple LLM, inspired from Andrej Karpathy's "GPT from scratch" video (https://www.youtube.com/watch?v=kCc8FmEb1nY). Then, I added onto that the GRPO algorithm, which in itself is very simple. I made a GitHub repo if you want to try it out : https://github.com/Al-th/grpo_experiment |
|