Hacker News new | ask | show | jobs
by gwern 2394 days ago
He's also wrong about a lot. For all his insistence on accuracy, he himself is misleading or ignorant. Like his slam of the Dartmouth project - even if you knew nothing about it, all you have to do is click through to see the claims of 'solving vision' are sheer projection and urban legend. And he's happy to make up claims out of whole cloth: for example, when he says "AlphaGo works fine on a 19x19 board, but would need to be retrained to play on a rectangular board; the lack of transfer is telling.", for which he provides precisely zero evidence, he ignores the fact that AG training works fine on a mixture of board sizes and such progressive growing/curriculum training in fact seems to accelerate training (https://arxiv.org/abs/1902.10565) and that rectangular convolutions are a thing that exist, and there would be plenty of transfer if anyone tried. If no one has tried that exact thing, it's because it would be pointless and it's obvious to everyone not named 'Marcus' that it'd work fine, being rectangular doesn't mystically stop it from working anymore than using a 13x13 rather than 19x19 board makes AG-style training stop working. (This is not the first time Marcus has claimed that something didn't work; I first realized that he doesn't actually keep up with the AI literature when I pointed out to him on Twitter that plenty of knowledge graphs were in use combined with DL, and Google Search was the biggest example of this, and he had no idea what I was talking about.)

No matter what DL or DRL does, or how little his own preferred paradigm does, Marcus will never ever admit anything. AlphaGo beats humans? Well, it just copied humans. AlphaZero learns from scratch? Well, he wrote a whole paper explaining how akshully it still copies humans because the tree search encodes the rules. MuZero throws out even the tree search's knowledge of rules? Crickets and essays about 'misinformation'.

2 comments

Gary Marcus' tone amounts to base trolling and it's a shame that's how he chooses to carry out his criticism. I understand your frustration.

Regarding the lack of transfer, yes, AlphaGo, AlphaZero and most of their variants have boards of fixed size and shape hard-coded in their architecture (as they have the types of piece moves-hard coded) and need architectural modifications and re-training before they can play on different boards or with different pieces (e.g. AlphaGo can't play Chess and Shoggi unmodified). The KataGo paper (the paper you linked) is one exception to this. Personally, I don't know others. Anyway general game-playing is a hard task and nobody claims it's solved by AlphaGo.

Regarding KataGo its main contribution is a significant reduction to the cost of training an AlpahGo variant while maintaining a competitive performance. This is very promising- after DeepBlue, creating a chess engine became cheaper and cheaper until they could run on a smartphone. We are far from that with Go computer players.

However, in the KataGo paper, major gains are claimed to come from a) game-playing specific or MCTS-specific improvements (playout cap randomisation, forced playouts and policy target pruning) or architecture-specific improvements (global pooling) or, b) domain-specific improvements (auxiliary ownership and score targets). Finally, KataGo has a few game-specific features (liberties, pass-alive regions and ladder features).

The KataGo paper itself says it very clearly. I quote, snipping for brevity:

Second, our work serves as a case study that there is still a significant efficiency gap between AlphaZero's methods and what is possible from self-play. We find nontrivial further gains from some domain-specific methods (...) We also find that a set of standard game-specific input features still significantly accelerates learning, showing that AlphaZero does not yet obsolete even simple additional tuning.

Finally, "it would obviously work so nobody tried" would make sense if it wasn't for the extremely competitive nature of machine learning research where every novel result is presented as a big breakthrough. Also, if something is obvious but never seems to make it to publication the chances are someone has tried and it didn't work as expected so they shelved the paper. We all know what happens to negative results in machine learning.

> The KataGo paper (the paper you linked) is one exception to this.

And my point is that KataGo shows that if you make the relatively minor architectural changes necessary to do this at all, it works just fine. None of the other tweaks it makes, useful as they may be, have anything to do with fixing transfer learning, because there's nothing to fix. It's a pretty absurd claim to claim that a CNN which works fine on 19x19 will suddenly collapse and show no transfer on, say, 17x17, and KataGo demonstrates that this does not happen.

> Also, if something is obvious but never seems to make it to publication the chances are someone has tried and it didn't work as expected so they shelved the paper.

'What if Go but rectangular boards' is pretty dumb when you have chess and shogi and other domains showing that A0 works, so I feel confident that no one like DM seriously tried and simply buried their failures. (Publication bias requires there to be a literature that can be differentially published, and competition presumes the existence of >0 entities competing; there is no active field of 'rectangular Go'.)

Regarding MuZero- I confess to not have read the paper very carefully, but I am confused by its claim that the new system achieves superhuman performance without knowing the game rules.

Specifically, MuZero uses MCTS and MCTS needs to have at the very least a move generator in order to produce actions that can then be evaluated for their results. The trained MuZero model learns the transition function and evaluation function but I don't see in the paper where it learns what actions are legal in the domain. And I don't understand how any architecture could model the possible moves in a game without observing examples of external play (i.e. not self-play).

MuZero reuses the AlphaZero architecture so most likely the moves of the pieces for Chess, Shoggi and Go are hard-coded in the architecture, as they are in AlphaZero. There's also probably some similar hard-coding of Atari actions, which I'm probably missing in the paper.

> but I am confused by its claim ... without knowing the game rules.

> probably some similar hard-coding of Atari actions

Nope, no hard coding.

Consider trying to MCTS on an Atari game. You have to "learn to predict" the <next frame, action> pairs. Initially this guess is very bad, but eventually your predictions are good enough that rolling out a tree of predictions improves your action selection

For Go, and chess, we twist our self into NOT using the game rules in the simulator e.g. for each move, just indicate if GAME LOSS WIN

Whether this paper worthy of a new Nature hype cycle is a separate debate

>> You have to "learn to predict" the <next frame, action> pairs.

But where do the actions come from?

For example, if I play chess, I could pick up a piece and throw it at my opponent's head. Similarly, if I play Atari I could chuck the controller at the monitor. These are actions I can perform that are available to me because of my basic human anatomy and because of the laws of physics (I can grab and throw and a thrown object flies through the air untl it hits a target or gravity wins).

In the case of MuZero, what actions can the system perform and where do they come from? I don't see where that is described in the paper.

>> For Go, and chess, we twist our self into NOT using the game rules in the simulator e.g. for each move, just indicate if GAME LOSS WIN

Similarly - what determines "each move"?

EDIT: I can see in the MuZero paper that "Final outcomes {lose,draw,win} in board games are treated as rewards $u_t \in {-1,0,+1}$ occurring at the final step of the episode" but I also can't see where these come from, what tells the model that a loss, draw or win has occurred at the end of an episode.

I mean, if you're telling the model what actions can be performed and what end-states values are, then what game rules are you _not_ giving to the system?

As someone patiently explained to me 2 yrs ago...

For the ATARI, the "real world" is the present frame, and a fixed set of 4 buttons and 4 directions. This of course is the game pre-programmed into the ALE ROM.

You can take any action, and get the next frame. but you cant "undo" an action, and you cant restart a game from a fixed state (see the Go-Explore controversy). And you cant explore 4 different actions in an interesting frame.

So now, if you learn a network which predicts the next frame, you can enter the world of model-based learning, where we do a simulated move tree roll-out (i.e. not calling the ATARI), try a gazillions moves, and only then select an action and get the next sample.

In a formally defined synthetic domain such as chess or logic programming, it is not clear whether this is helpful. We are simply trading one cpu time (calling the environment) for other cpu time (running our own learned im-precise model of the environment)

Of course DM has a chess function which does codes the rules of the next move. It can return a LOSS if you try an illegal move. But this function is NOT called for the tree roll out.

Thanks for your patience but this is still confusing. It's clear from your explanation that the moves and end-game states are given at the start of learning (now that you mention it, I remember the bit about illegal actions leading to a game loss). So training does not start from scratch without knowing anything about the game. The system knows what moves are _legal_ (not just possible) and it knows when the game ends, and how to score it. I don't see how this supports the claim of "no rules".

I appreciate that someone explaiend this to you at some point but I'm going with what I've read in some of the published papers and the ones I've read really leave a lot to the imagination. That is no way to present and support such big claims as "no rules", "no hands", especially when this is the central claim in a paper. Why fudge this so much when it's such an important aspect of the whole contribution? [1] You (general you) make a claim? Support the claim.

I didn't get what you mean about logic programming? Where does that come in?

________________

[1] Oh, I know why. It's the whole silly game with machine learning publications where they never tell you everything and you have to figure it out yourself. Well I like to play the other game, where I call bullshit unless it's explained clearly. In the paper. Not on Twitter and not by kind colleagues.

Silly games don't advance the science though.

>> Of course DM has a chess function which does codes the rules of the next move. It can return a LOSS if you try an illegal move. But this function is NOT called for the tree roll out.

I see what you mean- the chess function computes the results of actions returned by the system. But, if you do rollouts you need to have a set of actions from which to choose and an internal representation of states resulting from those actions. MuZero learns to predict those actions and states- but that means it selects from sets of possible actions and states. The paper does not explain where do these sets come from.

For ATARI I get it, there's the physical ish controls and video frames. For the board games however, I remember very clearly from the AlphaZero paper that there was an encoding of "knight moves" and "queen moves". I also remember less clearly that the structure of the network's layers mirrored the layout of a chessboard. That's what I mean by hard-coding and in the MuZero paper there are many references to reusing the AlphaZero archietecture and no explanation of how the same components (board states, moves) are represented in MuZero.

> Specifically, MuZero uses MCTS and MCTS needs to have at the very least a move generator in order to produce actions that can then be evaluated for their results.

You are confusing the two phases. The MuZero training does not use MCTS, it merely observes sequences of moves/states/rewards. This can be done using observations from anywhere: human games, AG games, A0 games, random games. This is where it does the actual learning of what moves are valid and what makes moves good (because invalid moves will not be represented in the dataset of valid games). It does not need MCTS or any access to an oracle about move validity, which is Marcus's complaint. This is no more cheating than observing the real world to infer its physics.

The second phase, where new games are generated, may use MCTS. But it doesn't have to. So it can learn by simply training on a game corpus, and then generating a new game corpus by self-play using only its internal implicit tree search and something like illegal moves = instant loss. It will rapidly learn to not make illegal moves and play just as validly as a MCTS-structured tree search, and then its implicit learned tree search achieves the same or greater playing strength.

>> The MuZero training does not use MCTS, it merely observes sequences of moves/states/rewards.

I'm sorry, I read the paper a bit more carefully since we're discussing it and I don't think this is right. It's true that it's a while since I read the AlphaZero paper and the details are a bit fuzzy in my memory, but in the MuZero paper it's clear that MCTS is used to generate a policy and estimated value for a current hidden state, and to select an action to take at the current real game state (the "environment"), then the observed state and reward are later reused as past observations to train the model, together with future actions, also selected by MCTS. So it seems to me that MCTS is pretty central to the training process.

The paper does say that any MDP could be used in place of MCTS but I don't think anyone seriously plans on using something else than MCTS for board games in the foreseeable future.

I'm confused by your use of the term "implicit tree search". Could you clarify?