| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by porridgeraisin 583 days ago
	You will have hyperparameters that weight the KL divergence (between the updated policy distribution and the current policy distribution). This helps you tune how sensitive the training process is. Entropy maximization is common in offline RL specifically as it ensures the policy has some non determinism at least and isn't bound too closely to the data you have collected, to the point of basically being deterministic. This is also tunable with a weight.