| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by peripitea 2403 days ago
	I'm not super familiar with AI/ML/RL at all, so I'm sure this is a naive question, but isn't it obvious that the answer is to just build in costs to the utility function for behaviors you want to avoid (what they seem to refer to as constrained RL in the article)? That seems both the simplest way to handle it, and most elegant in terms of mapping to the real world domain. Like are there alternate solutions that are even remotely competitive with this? I'm sure I must be oversimplifying and I assume that there's some nuance I'm missing. E.g. is this more about how you design those constraints to minimize the overall loss in learning efficiency, or something like that?

7 comments

Tyr42 2402 days ago

I think the answer is that "just building in costs" is actually rather hard to get right.

Check out how Concrete Problems in AI Safety (Section 6 in particular is about safe exploration)

https://arxiv.org/pdf/1606.06565.pdf

Quote:

In practice, real world RL projects can often avoid these issues by simply hard-coding an avoidance of catastrophic behaviors. For instance, an RL-based robot helicopter might be programmed to override its policy with a hard-coded collision avoidance sequence (such as spinning its propellers to gain altitude) whenever it’s too close to the ground. This approach works well when there are only a few things that could go wrong, and the designers know all of them ahead of time. But as agents become more autonomous and act in more complex domains, it may become harder and harder to anticipate every possible catastrophic failure. The space of failure modes for an agent running a power grid or a search-and-rescue operation could be quite large. Hard-coding against every possible failure is unlikely to be feasible in these cases, so a more principled approach to preventing harmful exploration seems essential. Even in simple cases like the robot helicopter, a principled approach would simplify system design and reduce the need for domain-specific engineering

link

Roark66 2402 days ago

>I think the answer is that "just building in costs" is actually rather hard to get right.

Exactly. It is almost as if we need AI to resolve the problem of properly supervising AI's training. I was wondering if the solution would be to add to classic actor-critic system a third network called a supervisor. The difference between the critic and supervisor would be architecture and the goal of the supervisor would be avoidance of those "terrible" outcomes. Some experiments would have to be run to decide if this approach is viable or do we have to continue tweaking cost functions.

Regarding Safety Gym I'm not sure how what they are doing differs from simply hard coding into your training procedure a series of checks for probability of hitting disallowed states in next step. For example in their example of a robotic arm that is trained with humans around the hard coded algorithm could track people around the arm's work envelope and when some person is detected as approaching it gives the robot a cost penalty. Also, for this to result in trained avoidance of people the network would have to have sufficient inputs to detect people by itself.

link

peripitea 2402 days ago

Yes, that seems like an important problem, but one separate to what they're describing in OP's article. (Again, assuming I'm understanding this right.) Their constrained RL approach is still relying on our ability to enumerate and assign costs to the undesirable behaviors, right? From reading the article, I get the impression that they are focused on addressing that scenario, and leaving the problem of how to enumerate all undesirable behaviors to separate research.

link

sanxiyn 2402 days ago

Constrained RL is a way to say "thou shalt not murder", instead of saying "murder is utility -10000".

link

ivalm 2403 days ago

There are a lot of direct technical reason this might not work (not all edge cases are sufficiently sampled).

But there is also a "fundamental" issue of it being difficult/impossible to enumerate "bad behaviors". This is an issue related to a lot of AI safety, including AGI safety as discussed by for example in Nick Bostrom's "Superintelligence" (https://www.amazon.com/dp/B00LOOCGB2)

link

jefft255 2403 days ago

That works but to learn to avoid these "bad" things, in the setting you describe, the agent has to first make those mistakes and learn from them. There are mistakes we don't want the agent to make, ever. That's what safe RL is about.

link

est31 2403 days ago

The approach you describe is mentioned in the article as "normal RL". Constrained RL is an advanced mode of it where you are given direct control over how often some safety constraint should be violated. Basically constrained RL is just automating away the part where you are manually adjusting the "normal RL" punishments to fit your constraints.

link

idlewords 2402 days ago

"just" is the word 99% of the work hides behind.

link

TTPrograms 2403 days ago

The main issue there is just that if you see something in operation sufficiently different from training you may violate those penalties. Eg. you can train an arm not to hit a person in simulation by penalizing it, but that doesn't guarantee there's not an input that would still cause the safety violation. Generalization in these regards can be still be shockingly bad for modern approaches.

link

currymj 2402 days ago

this is how most of the constrained MDP stuff effectively works, it’s not a bad intuition that it is just different kinds of reward shaping.

in some approaches you write down the Lagrangian of the RL reward-maximizing problem and then the hard constraints become (perhaps infinitely strong) soft penalties.

link