| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by YeGoblynQueenne 2391 days ago

>> You have to "learn to predict" the <next frame, action> pairs.

But where do the actions come from?

For example, if I play chess, I could pick up a piece and throw it at my opponent's head. Similarly, if I play Atari I could chuck the controller at the monitor. These are actions I can perform that are available to me because of my basic human anatomy and because of the laws of physics (I can grab and throw and a thrown object flies through the air untl it hits a target or gravity wins).

In the case of MuZero, what actions can the system perform and where do they come from? I don't see where that is described in the paper.

>> For Go, and chess, we twist our self into NOT using the game rules in the simulator e.g. for each move, just indicate if GAME LOSS WIN

Similarly - what determines "each move"?

EDIT: I can see in the MuZero paper that "Final outcomes {lose,draw,win} in board games are treated as rewards $u_t \in {-1,0,+1}$ occurring at the final step of the episode" but I also can't see where these come from, what tells the model that a loss, draw or win has occurred at the end of an episode.

I mean, if you're telling the model what actions can be performed and what end-states values are, then what game rules are you _not_ giving to the system?

1 comments

yazr 2391 days ago

As someone patiently explained to me 2 yrs ago...

For the ATARI, the "real world" is the present frame, and a fixed set of 4 buttons and 4 directions. This of course is the game pre-programmed into the ALE ROM.

You can take any action, and get the next frame. but you cant "undo" an action, and you cant restart a game from a fixed state (see the Go-Explore controversy). And you cant explore 4 different actions in an interesting frame.

So now, if you learn a network which predicts the next frame, you can enter the world of model-based learning, where we do a simulated move tree roll-out (i.e. not calling the ATARI), try a gazillions moves, and only then select an action and get the next sample.

In a formally defined synthetic domain such as chess or logic programming, it is not clear whether this is helpful. We are simply trading one cpu time (calling the environment) for other cpu time (running our own learned im-precise model of the environment)

Of course DM has a chess function which does codes the rules of the next move. It can return a LOSS if you try an illegal move. But this function is NOT called for the tree roll out.

link

YeGoblynQueenne 2391 days ago

Thanks for your patience but this is still confusing. It's clear from your explanation that the moves and end-game states are given at the start of learning (now that you mention it, I remember the bit about illegal actions leading to a game loss). So training does not start from scratch without knowing anything about the game. The system knows what moves are _legal_ (not just possible) and it knows when the game ends, and how to score it. I don't see how this supports the claim of "no rules".

I appreciate that someone explaiend this to you at some point but I'm going with what I've read in some of the published papers and the ones I've read really leave a lot to the imagination. That is no way to present and support such big claims as "no rules", "no hands", especially when this is the central claim in a paper. Why fudge this so much when it's such an important aspect of the whole contribution? [1] You (general you) make a claim? Support the claim.

I didn't get what you mean about logic programming? Where does that come in?

________________

[1] Oh, I know why. It's the whole silly game with machine learning publications where they never tell you everything and you have to figure it out yourself. Well I like to play the other game, where I call bullshit unless it's explained clearly. In the paper. Not on Twitter and not by kind colleagues.

Silly games don't advance the science though.

link

YeGoblynQueenne 2391 days ago

>> Of course DM has a chess function which does codes the rules of the next move. It can return a LOSS if you try an illegal move. But this function is NOT called for the tree roll out.

I see what you mean- the chess function computes the results of actions returned by the system. But, if you do rollouts you need to have a set of actions from which to choose and an internal representation of states resulting from those actions. MuZero learns to predict those actions and states- but that means it selects from sets of possible actions and states. The paper does not explain where do these sets come from.

For ATARI I get it, there's the physical ish controls and video frames. For the board games however, I remember very clearly from the AlphaZero paper that there was an encoding of "knight moves" and "queen moves". I also remember less clearly that the structure of the network's layers mirrored the layout of a chessboard. That's what I mean by hard-coding and in the MuZero paper there are many references to reusing the AlphaZero archietecture and no explanation of how the same components (board states, moves) are represented in MuZero.

link