I'm not super familiar with AI/ML/RL at all, so I'm sure this is a naive question, but isn't it obvious that the answer is to just build in costs to the utility function for behaviors you want to avoid (what they seem to refer to as constrained RL in the article)? That seems both the simplest way to handle it, and most elegant in terms of mapping to the real world domain. Like are there alternate solutions that are even remotely competitive with this? I'm sure I must be oversimplifying and I assume that there's some nuance I'm missing. E.g. is this more about how you design those constraints to minimize the overall loss in learning efficiency, or something like that?
In practice, real world RL projects can often avoid these issues by simply hard-coding an avoidance
of catastrophic behaviors. For instance, an RL-based robot helicopter might be programmed to
override its policy with a hard-coded collision avoidance sequence (such as spinning its propellers to
gain altitude) whenever it’s too close to the ground. This approach works well when there are only
a few things that could go wrong, and the designers know all of them ahead of time. But as agents
become more autonomous and act in more complex domains, it may become harder and harder to
anticipate every possible catastrophic failure. The space of failure modes for an agent running a
power grid or a search-and-rescue operation could be quite large. Hard-coding against every possible
failure is unlikely to be feasible in these cases, so a more principled approach to preventing harmful
exploration seems essential. Even in simple cases like the robot helicopter, a principled approach would simplify system design and reduce the need for domain-specific engineering
>I think the answer is that "just building in costs" is actually rather hard to get right.
Exactly. It is almost as if we need AI to resolve the problem of properly supervising AI's training. I was wondering if the solution would be to add to classic actor-critic system a third network called a supervisor. The difference between the critic and supervisor would be architecture and the goal of the supervisor would be avoidance of those "terrible" outcomes. Some experiments would have to be run to decide if this approach is viable or do we have to continue tweaking cost functions.
Regarding Safety Gym I'm not sure how what they are doing differs from simply hard coding into your training procedure a series of checks for probability of hitting disallowed states in next step. For example in their example of a robotic arm that is trained with humans around the hard coded algorithm could track people around the arm's work envelope and when some person is detected as approaching it gives the robot a cost penalty. Also, for this to result in trained avoidance of people the network would have to have sufficient inputs to detect people by itself.
Yes, that seems like an important problem, but one separate to what they're describing in OP's article. (Again, assuming I'm understanding this right.) Their constrained RL approach is still relying on our ability to enumerate and assign costs to the undesirable behaviors, right? From reading the article, I get the impression that they are focused on addressing that scenario, and leaving the problem of how to enumerate all undesirable behaviors to separate research.
There are a lot of direct technical reason this might not work (not all edge cases are sufficiently sampled).
But there is also a "fundamental" issue of it being difficult/impossible to enumerate "bad behaviors". This is an issue related to a lot of AI safety, including AGI safety as discussed by for example in Nick Bostrom's "Superintelligence" (https://www.amazon.com/dp/B00LOOCGB2)
That works but to learn to avoid these "bad" things, in the setting you describe, the agent has to first make those mistakes and learn from them. There are mistakes we don't want the agent to make, ever. That's what safe RL is about.
The approach you describe is mentioned in the article as "normal RL". Constrained RL is an advanced mode of it where you are given direct control over how often some safety constraint should be violated. Basically constrained RL is just automating away the part where you are manually adjusting the "normal RL" punishments to fit your constraints.
The main issue there is just that if you see something in operation sufficiently different from training you may violate those penalties. Eg. you can train an arm not to hit a person in simulation by penalizing it, but that doesn't guarantee there's not an input that would still cause the safety violation. Generalization in these regards can be still be shockingly bad for modern approaches.
this is how most of the constrained MDP stuff effectively works, it’s not a bad intuition that it is just different kinds of reward shaping.
in some approaches you write down the Lagrangian of the RL reward-maximizing problem and then the hard constraints become (perhaps infinitely strong) soft penalties.
Everything about "openAI" institute seems to be designed to appeal to frightened, paranoid billionaire donors who think they need to be kept safe from near relatives to logistic regression and the remote control for their television, because muh singularity.
Can't you just call it "constrained reinforcement learning" without sexing it up for Elon? I guess not.
I like Elon just fine, but OpenAI is basically funded by billionaire donations, and wouldn't exist at all if he hadn't read dumb science fiction masquerading as modern day science fact.