| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by programjames 675 days ago

They're good for reinforcement learning. E.g. Cicero uses piKL which samples according to

p ∝ anchor_policy * exp(utility / temperature)

The utility is exactly the same as "energy". The article ignores entropy, but you can add in entropy regularization e.g. in soft actor-critic.