Safety Gym | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Safety Gym (openai.com)
	78 points by yigitdemirag 2400 days ago

4 comments

peripitea 2399 days ago

I'm not super familiar with AI/ML/RL at all, so I'm sure this is a naive question, but isn't it obvious that the answer is to just build in costs to the utility function for behaviors you want to avoid (what they seem to refer to as constrained RL in the article)? That seems both the simplest way to handle it, and most elegant in terms of mapping to the real world domain. Like are there alternate solutions that are even remotely competitive with this? I'm sure I must be oversimplifying and I assume that there's some nuance I'm missing. E.g. is this more about how you design those constraints to minimize the overall loss in learning efficiency, or something like that?

Tyr42 2399 days ago

I think the answer is that "just building in costs" is actually rather hard to get right.

Check out how Concrete Problems in AI Safety (Section 6 in particular is about safe exploration)

https://arxiv.org/pdf/1606.06565.pdf

Quote:

In practice, real world RL projects can often avoid these issues by simply hard-coding an avoidance of catastrophic behaviors. For instance, an RL-based robot helicopter might be programmed to override its policy with a hard-coded collision avoidance sequence (such as spinning its propellers to gain altitude) whenever it’s too close to the ground. This approach works well when there are only a few things that could go wrong, and the designers know all of them ahead of time. But as agents become more autonomous and act in more complex domains, it may become harder and harder to anticipate every possible catastrophic failure. The space of failure modes for an agent running a power grid or a search-and-rescue operation could be quite large. Hard-coding against every possible failure is unlikely to be feasible in these cases, so a more principled approach to preventing harmful exploration seems essential. Even in simple cases like the robot helicopter, a principled approach would simplify system design and reduce the need for domain-specific engineering

Roark66 2399 days ago

>I think the answer is that "just building in costs" is actually rather hard to get right.

Exactly. It is almost as if we need AI to resolve the problem of properly supervising AI's training. I was wondering if the solution would be to add to classic actor-critic system a third network called a supervisor. The difference between the critic and supervisor would be architecture and the goal of the supervisor would be avoidance of those "terrible" outcomes. Some experiments would have to be run to decide if this approach is viable or do we have to continue tweaking cost functions.

Regarding Safety Gym I'm not sure how what they are doing differs from simply hard coding into your training procedure a series of checks for probability of hitting disallowed states in next step. For example in their example of a robotic arm that is trained with humans around the hard coded algorithm could track people around the arm's work envelope and when some person is detected as approaching it gives the robot a cost penalty. Also, for this to result in trained avoidance of people the network would have to have sufficient inputs to detect people by itself.

peripitea 2399 days ago

Yes, that seems like an important problem, but one separate to what they're describing in OP's article. (Again, assuming I'm understanding this right.) Their constrained RL approach is still relying on our ability to enumerate and assign costs to the undesirable behaviors, right? From reading the article, I get the impression that they are focused on addressing that scenario, and leaving the problem of how to enumerate all undesirable behaviors to separate research.

sanxiyn 2399 days ago

Constrained RL is a way to say "thou shalt not murder", instead of saying "murder is utility -10000".

ivalm 2399 days ago

There are a lot of direct technical reason this might not work (not all edge cases are sufficiently sampled).

But there is also a "fundamental" issue of it being difficult/impossible to enumerate "bad behaviors". This is an issue related to a lot of AI safety, including AGI safety as discussed by for example in Nick Bostrom's "Superintelligence" (https://www.amazon.com/dp/B00LOOCGB2)

jefft255 2399 days ago

That works but to learn to avoid these "bad" things, in the setting you describe, the agent has to first make those mistakes and learn from them. There are mistakes we don't want the agent to make, ever. That's what safe RL is about.

est31 2399 days ago

The approach you describe is mentioned in the article as "normal RL". Constrained RL is an advanced mode of it where you are given direct control over how often some safety constraint should be violated. Basically constrained RL is just automating away the part where you are manually adjusting the "normal RL" punishments to fit your constraints.

idlewords 2399 days ago

"just" is the word 99% of the work hides behind.

TTPrograms 2399 days ago

The main issue there is just that if you see something in operation sufficiently different from training you may violate those penalties. Eg. you can train an arm not to hit a person in simulation by penalizing it, but that doesn't guarantee there's not an input that would still cause the safety violation. Generalization in these regards can be still be shockingly bad for modern approaches.

currymj 2399 days ago

this is how most of the constrained MDP stuff effectively works, it’s not a bad intuition that it is just different kinds of reward shaping.

in some approaches you write down the Lagrangian of the RL reward-maximizing problem and then the hard constraints become (perhaps infinitely strong) soft penalties.

Jefro118 2399 days ago

On this topic, if anyone wants to understand the behind the scenes of working on and maintaining projects like this, I did an interview with a maintainer of OpenAI Gym here: https://www.sourcesort.com/interview/peter-zhokhov-open-ai-g...

sanxiyn 2399 days ago

If you like this, you may also enjoy "AI Safety Gridworlds" from DeepMind: https://arxiv.org/abs/1711.09883

scottlocklin 2399 days ago

Everything about "openAI" institute seems to be designed to appeal to frightened, paranoid billionaire donors who think they need to be kept safe from near relatives to logistic regression and the remote control for their television, because muh singularity.

Can't you just call it "constrained reinforcement learning" without sexing it up for Elon? I guess not.

jesseb 2399 days ago

Musk resigned from his seat on the board in 2018. Sam Altman is the current CEO. Not sure what you're getting at other than the usual Musk hate.

scottlocklin 2399 days ago

I like Elon just fine, but OpenAI is basically funded by billionaire donations, and wouldn't exist at all if he hadn't read dumb science fiction masquerading as modern day science fact.

worik 2399 days ago

And your problem is???