Hacker News new | ask | show | jobs
by jonath_laurent 2184 days ago
Author here: I am happy to answer any question you may have about AlphaZero.jl. :-)
6 comments

I am confused about the FAST part, it is faster than all the other implementation (some of them are in c++) or it is just julia implementation and you think it is fast? I am asking because if julia is faster than c++ for ml/dl I would prefer to use it for production use cases.
This needs clarification indeed. As I explain in the documentation, the aim of AlphaZero.jl is not to compete with hyper-specialized and hyper-optimized implementations such as LC0 or ELF OpenGO. These implementations are written in C++ with custom CUDA kernels and they are optimized for highly distributed computing environments. They are also very complex and therefore pretty inaccessible to students and researchers.

The philosophy of AlphaZero.jl is to provide an implementation of AlphaZero that is simple enough to be widely accessible for students and researchers, while also being sufficiently powerful and fast to enable meaningful experiments on limited computing resources. It has the simplicity of the many existing python implementations, while being consistently between one and two orders of magnitude faster.

More generally, the AlphaZero algorithm is extremely general and I think it can find applications in many research domains (including automated theorem proving, which is my own research area). I have been surprised to see that, despite the general excitement around AlphaZero, very few people actually tried to build on it. One explanation, I think, is the lack of accessible open-source implementations. I am trying to bridge this gap with AlphaZero.jl.

Always great to see someone who finds something broken, and fixes it for others to move forward.

Thank you.

In addition to the excellent answer by the author below, I'd like to say that Julia can get within spitting distance (or even sometimes exceed) C++ speeds (and even BLAS). So if a comparable amount of work went into optimizing specific paths through generic code (or even the generic code itself), it could be as fast. Also, one can write CUDA kernels in pure Julia.
Looks like I need to start learning julia
Nice work on this! I was behind the implementation at oracle which you referenced in the tutorial. I still keep tabs on the lc0 crowd which seems to be pushing into new ideas. Did you pull anything else from the leela crowd besides prior-temperature? It looks like maybe you also tried a WLD output head as well?
What do you mean by WLD output head?

So far, the main idea I have pulled from the Lc0 crowd is to have a prior temperature indeed. The next thing I am planning to add is the possibility to batch inference requests across game simulations instead of relying on asynchronous MCTS. In your blog series, you anticipate the problem of the virtual loss introducing some exploration bias in the search but ultimately concludes that it does not change much:

[Citation from your blog series]: "Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use."

Interestingly, it seems that the LC0 team had a different experience here. I myself ran some tests and going from 32 to 4 workers (for 600 MCTS simulations per turn) on my connect-four agent results in a significant increase in performances. This may be due to the fact that I use a much smaller neural network than yours, which is ultimately not as strong.

Related to this, there is a question I have wanted to ask you since I found your blog article series: did you make experiments with smaller networks and what were the results? What is the smallest architecture you tried and how did it perform?

The lc0 group has switched the result prediction to predict win, loss, and draw probabilities instead of just win/loss. Some information can be found in https://lczero.org/blog/2020/04/wdl-head/
we did a lot of our early experimentation with small networks. I don't think we went any smaller than 5 layers of 64 filters as we mentioned here: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5...
And what were the results of these experiments? What error rate can you reach with the smallest network architecture you tried for example?
Unfortunately I don't remember the exact numbers, but I think it was a couple percentage points worse than we were able to get with the large models.
This is interesting, thanks! Is there anything else you can tell me about the results of your experiments with small networks? I am really interested in this.

For example: did you notice than increasing or decreasing network size required significant changes in other hyperparameters? Are small networks learning faster at the beginning of training before they start to plateau?

Hi, thanks for this great project.

Connect Four was used as a demonstration. I presume this is because it's much easier/cheaper to train a Connect Four AI, compared to Go?

Yes. Go 19x19 would be completely intractable on a single machine (one comment is citing a $25 million cost estimate in computing power to train AlphaGo Zero). A more reasonable target would be Go 9x9 but even this would be an extreme challenge on a single machine.

There is an Oracle blog article series about training a close-to-perfect Connect Four player using AlphaZero. Even here, they had to rely on multiple GPUs.

You have to keep in mind that AlphaZero is an extremely sample-inefficient learning technique, even for simple problems. Rather, the strengths of this algorithm is that 1) it is pretty generic and 2) it can leverage huge amounts of computation.

Do you have any thoughts about multi GPU training? I haven't seen many options for Flux previously, but didn't dig very much.
Multiple GPUs support definitely belongs to the TODO list. However, I am currently limited by the state of CUDA.jl on this, as it does not have a device-aware memory pool yet.

I am also looking forward to CUDA.jl supporting f16 and int8 computations, which may enable another big speedup.

No questions. Just wanted to thank you for sharing. People like you make the world better one tiny bit at a time.
Thanks for your kind message.
This is great! But what are your thoughts about MuZero? :)
I think that MuZero is a fascinating algorithm, but that a lot of news articles are misleading when they present it as a new, superior substitute for AlphaZero.

MuZero is solving a harder problem, in which the learning agent does not have a model of the environment from the start (e.g. it does not know the rules of the game a priori). This makes it potentially applicable to a larger number of real-world challenges.

However, I haven't seen any evidence that it is any better than AlphaZero at learning games such as Chess or Go. Although DeepMind reports that their MuZero agent "slightly exceeds the performances of AlphaZero on Go", they say nothing about the training time and tuning effort spent on each.

As far as I understand and in the absence of further data, I think AlphaZero is still the superior choice to solve games with known rules, especially if you don't have DeepMind's level of computing resources.

If anyone knows better about this, I would be happy to be proven wrong though.