Hacker News new | ask | show | jobs
by currymj 2399 days ago
this is how most of the constrained MDP stuff effectively works, it’s not a bad intuition that it is just different kinds of reward shaping.

in some approaches you write down the Lagrangian of the RL reward-maximizing problem and then the hard constraints become (perhaps infinitely strong) soft penalties.