Hacker News new | ask | show | jobs
by robbensinger 3846 days ago
The problems aren't completely unprecedented (else we'd have basically no knowledge about them), but they become more severe in the scenarios Bostrom/Russell/etc. are talking about.

I would say that the central concern is with notional systems that can form detailed, accurate models of the world and efficiently search through the space of policies that can be expected to produce a given outcome according to the model. This can be a recommender system that tells other agents what policies to adopt, or it can execute the policies itself.

If the search process through policies is sufficiently counter-intuitive and opaque to operator inspection, the "Sorcerer's Apprentice" problem becomes much more severe than it is in ordinary software. As the system becomes more capable, it can look increasingly safe and useful in its current context and yet remain brittle in the face of changes to itself and its environment. This is also where convergent instrumental goals become more concerning, because systems with imperfectly understood/designed policy selection criteria (introducing an element of randomness, from our perspective) seem likely to converge on adversarial policies due to the general fact of resource limitations.

There's no reason to think this kind of system is inevitable, but it's worth investigating how likely we are to be able to develop superhuman planning/decision agents, on what timescale, and whether there are any actions we could take in advance to make it possible to use such systems safely. At this point not enough research-hours have gone into this topic to justify any strong conclusions about whether we can (or can't) make much progress today.

http://givewell.org/labs/causes/ai-risk gives a good summary of this topic.