| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by YeGoblynQueenne 752 days ago

>> "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"

What's a "policy" here? In optimal control (and reinforcement learning) a policy is a function from a set of states to a set of actions, each action a transition between states. In a program synthesis context I guess that translates to a function from a set of _program_ states to a set of operations?

What is an "optimal" policy then? One that transitions between an initial state and a goal state in the least number of operations?

With those assumptions in place, I don't think you want to do that with greadient descent: it will get stuck in local minima and fail in both optimality and generalisation.

Generalisation is easier to explain. Consider a program that has to traverse a graph. We can visualise it as solving a maze. Suppose we have two mazes, A and B, as below:

        A               B
  S □ □ ■ □ □ □   S □ □ ■ □ □ □ 
  ■ ■ □ ■ □ ■ □   ■ ■ □ ■ □ ■ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ ■ □ ■ ■ ■ □   □ ■ □ ■ ■ ■ □ 
  □ ■ □ ■ □ □ □   □ ■ □ ■ □ □ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ □ □ □ □ ■ E   E □ □ □ □ ■ □

Black squares are walls. Note that the two mazes are identical but the exit ("E") is in a different place. An optimal policy that solves maze A will fail on maze B and v.v. Meaning that for some classes of problem there is no policy that is optimal for the every instance in the class and finding an optimal solution requires computation. You can't just set some weights in a function and call it a day.

It's also easy to see what classes of problems are not amenable to this kind of solution: any decision problem that cannot be solved by a regular automaton (i.e. one that is no more than regular). Where there's branching structure that introduces ambiguity -think of two different parses for one string in a language- you need a context-free grammar or above.

That's a problem in Reinforcement Learning where "agents" (i.e. policies) can solve any instance of complex environment classes perfectly, but fail when tested in a different instance [1].

You'll get the same problem with program synthesis.

___________

[1] This paper:

Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

https://arxiv.org/abs/2107.06277

makes the point with what felt like a very convoluted example about a robotic zoo keeper looking for the otter habitat in a new zoo etc. I think it's much more obvious what's going on when we study the problem in a grid like a maze: there are ambiguities and a solution cannot be left to a policy that acts like a regular automaton.

1 comments

dwrodri 748 days ago

Thank for taking the time to explain such a worked out example. I was indeed picturing something a long the lines of "If you could write a program equivalent to a game where you solve a maze, could you produce a maze-solver program if the game were made in this runtime.

link