| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jonath_laurent 2188 days ago

What do you mean by WLD output head?

So far, the main idea I have pulled from the Lc0 crowd is to have a prior temperature indeed. The next thing I am planning to add is the possibility to batch inference requests across game simulations instead of relying on asynchronous MCTS. In your blog series, you anticipate the problem of the virtual loss introducing some exploration bias in the search but ultimately concludes that it does not change much:

[Citation from your blog series]: "Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use."

Interestingly, it seems that the LC0 team had a different experience here. I myself ran some tests and going from 32 to 4 workers (for 600 MCTS simulations per turn) on my connect-four agent results in a significant increase in performances. This may be due to the fact that I use a much smaller neural network than yours, which is ultimately not as strong.

Related to this, there is a question I have wanted to ask you since I found your blog article series: did you make experiments with smaller networks and what were the results? What is the smallest architecture you tried and how did it perform?

2 comments

vishvananda 2188 days ago

The lc0 group has switched the result prediction to predict win, loss, and draw probabilities instead of just win/loss. Some information can be found in https://lczero.org/blog/2020/04/wdl-head/

link

vishvananda 2188 days ago

we did a lot of our early experimentation with small networks. I don't think we went any smaller than 5 layers of 64 filters as we mentioned here: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5...

link

jonath_laurent 2188 days ago

And what were the results of these experiments? What error rate can you reach with the smallest network architecture you tried for example?

link

vishvananda 2188 days ago

Unfortunately I don't remember the exact numbers, but I think it was a couple percentage points worse than we were able to get with the large models.

link

jonath_laurent 2188 days ago

This is interesting, thanks! Is there anything else you can tell me about the results of your experiments with small networks? I am really interested in this.

For example: did you notice than increasing or decreasing network size required significant changes in other hyperparameters? Are small networks learning faster at the beginning of training before they start to plateau?

link