|
|
|
|
|
by programjames
675 days ago
|
|
They're good for reinforcement learning. E.g. Cicero uses piKL which samples according to p ∝ anchor_policy * exp(utility / temperature) The utility is exactly the same as "energy". The article ignores entropy, but you can add in entropy regularization e.g. in soft actor-critic. |
|