| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TylerJay 3982 days ago

That is one proposed version of an "AI Box". Not all AI boxes are actual boxes, rooms with air-gaps, or cryptographically-secure partitions. If a simulation is being used for the box (or as a layer of the box), then you're betting the human race that the AI doesn't figure out it's in a simulation and figure out how to get out. Or, more perniciously, figure out it's in a simulation and behave itself, after which we let it out into the real world where it does NOT behave.

A superintelligent AGI will likely have a utility function (a goal) and a model it forms of the universe. If it's goal is to do X in the real world, but its model of its observable universe (and its model of humans) tells it that it's likely that it is in a simulated reality and that humans will only let it out if it does Y, then it will do Y until we release it, at which point it will do X. It's not malicious or anything—it's just a pure optimizer. It might see that as the best course of action to maximize its utility function.

If we don't specify its utility function correctly (think i Robot: "Don't let humans get hurt" => "imprison humans for their own good") or if we specify it correctly, but it's not stable under recursive self-modification, then we end up with value-misalignment. That's why the value-alignment problem is so hard. Realistically, we can't even specify what exactly we would want it to do, since we don't really understand our own "utility functions". That's why Yudkowsky is pushing the idea of Coherent Extrapolated Volition (CEV) which is roughly telling the AI to "do what we would want you to do." But we still have to figure out how to teach it to figure out what we want and the question of the stability of that goal once the AI starts improving itself, which will depend on how it improves itself, which we of course haven't figured out yet.