| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anitakirkovska 512 days ago
	They were using techniques like PPO that have a model (like a critic!) that evaluates whether the new model gives accurate responses. With GRPO, the don't have that and instead evaluate the answers based on predefined rules like coherence/ formatting. For example, for math problems, these rules will check if the answer adheres to math principles or logic! I wrote more here (lmk if this is useful): https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-w...