| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jonath_laurent 2231 days ago
	Author here: I am happy to answer any question you may have about AlphaZero.jl. :-)

6 comments

master_yoda_1 2230 days ago

I am confused about the FAST part, it is faster than all the other implementation (some of them are in c++) or it is just julia implementation and you think it is fast? I am asking because if julia is faster than c++ for ml/dl I would prefer to use it for production use cases.

jonath_laurent 2230 days ago

This needs clarification indeed. As I explain in the documentation, the aim of AlphaZero.jl is not to compete with hyper-specialized and hyper-optimized implementations such as LC0 or ELF OpenGO. These implementations are written in C++ with custom CUDA kernels and they are optimized for highly distributed computing environments. They are also very complex and therefore pretty inaccessible to students and researchers.

The philosophy of AlphaZero.jl is to provide an implementation of AlphaZero that is simple enough to be widely accessible for students and researchers, while also being sufficiently powerful and fast to enable meaningful experiments on limited computing resources. It has the simplicity of the many existing python implementations, while being consistently between one and two orders of magnitude faster.

More generally, the AlphaZero algorithm is extremely general and I think it can find applications in many research domains (including automated theorem proving, which is my own research area). I have been surprised to see that, despite the general excitement around AlphaZero, very few people actually tried to build on it. One explanation, I think, is the lack of accessible open-source implementations. I am trying to bridge this gap with AlphaZero.jl.

O_H_E 2230 days ago

Always great to see someone who finds something broken, and fixes it for others to move forward.

Thank you.

dklend122 2230 days ago

In addition to the excellent answer by the author below, I'd like to say that Julia can get within spitting distance (or even sometimes exceed) C++ speeds (and even BLAS). So if a comparable amount of work went into optimizing specific paths through generic code (or even the generic code itself), it could be as fast. Also, one can write CUDA kernels in pure Julia.

master_yoda_1 2230 days ago

Looks like I need to start learning julia

vishvananda 2231 days ago

Nice work on this! I was behind the implementation at oracle which you referenced in the tutorial. I still keep tabs on the lc0 crowd which seems to be pushing into new ideas. Did you pull anything else from the leela crowd besides prior-temperature? It looks like maybe you also tried a WLD output head as well?

jonath_laurent 2230 days ago

What do you mean by WLD output head?

So far, the main idea I have pulled from the Lc0 crowd is to have a prior temperature indeed. The next thing I am planning to add is the possibility to batch inference requests across game simulations instead of relying on asynchronous MCTS. In your blog series, you anticipate the problem of the virtual loss introducing some exploration bias in the search but ultimately concludes that it does not change much:

[Citation from your blog series]: "Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use."

Interestingly, it seems that the LC0 team had a different experience here. I myself ran some tests and going from 32 to 4 workers (for 600 MCTS simulations per turn) on my connect-four agent results in a significant increase in performances. This may be due to the fact that I use a much smaller neural network than yours, which is ultimately not as strong.

Related to this, there is a question I have wanted to ask you since I found your blog article series: did you make experiments with smaller networks and what were the results? What is the smallest architecture you tried and how did it perform?

vishvananda 2230 days ago

The lc0 group has switched the result prediction to predict win, loss, and draw probabilities instead of just win/loss. Some information can be found in https://lczero.org/blog/2020/04/wdl-head/

vishvananda 2230 days ago

we did a lot of our early experimentation with small networks. I don't think we went any smaller than 5 layers of 64 filters as we mentioned here: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5...

jonath_laurent 2230 days ago

And what were the results of these experiments? What error rate can you reach with the smallest network architecture you tried for example?

vishvananda 2230 days ago

Unfortunately I don't remember the exact numbers, but I think it was a couple percentage points worse than we were able to get with the large models.

jonath_laurent 2230 days ago

This is interesting, thanks! Is there anything else you can tell me about the results of your experiments with small networks? I am really interested in this.

For example: did you notice than increasing or decreasing network size required significant changes in other hyperparameters? Are small networks learning faster at the beginning of training before they start to plateau?

MaxBarraclough 2231 days ago

Hi, thanks for this great project.

Connect Four was used as a demonstration. I presume this is because it's much easier/cheaper to train a Connect Four AI, compared to Go?

jonath_laurent 2231 days ago

Yes. Go 19x19 would be completely intractable on a single machine (one comment is citing a $25 million cost estimate in computing power to train AlphaGo Zero). A more reasonable target would be Go 9x9 but even this would be an extreme challenge on a single machine.

There is an Oracle blog article series about training a close-to-perfect Connect Four player using AlphaZero. Even here, they had to rely on multiple GPUs.

You have to keep in mind that AlphaZero is an extremely sample-inefficient learning technique, even for simple problems. Rather, the strengths of this algorithm is that 1) it is pretty generic and 2) it can leverage huge amounts of computation.

patagurbon 2231 days ago

Do you have any thoughts about multi GPU training? I haven't seen many options for Flux previously, but didn't dig very much.

jonath_laurent 2231 days ago

Multiple GPUs support definitely belongs to the TODO list. However, I am currently limited by the state of CUDA.jl on this, as it does not have a device-aware memory pool yet.

I am also looking forward to CUDA.jl supporting f16 and int8 computations, which may enable another big speedup.

doublesCs 2231 days ago

No questions. Just wanted to thank you for sharing. People like you make the world better one tiny bit at a time.

jonath_laurent 2231 days ago

Thanks for your kind message.

dandanua 2230 days ago

This is great! But what are your thoughts about MuZero? :)

jonath_laurent 2229 days ago

I think that MuZero is a fascinating algorithm, but that a lot of news articles are misleading when they present it as a new, superior substitute for AlphaZero.

MuZero is solving a harder problem, in which the learning agent does not have a model of the environment from the start (e.g. it does not know the rules of the game a priori). This makes it potentially applicable to a larger number of real-world challenges.

However, I haven't seen any evidence that it is any better than AlphaZero at learning games such as Chess or Go. Although DeepMind reports that their MuZero agent "slightly exceeds the performances of AlphaZero on Go", they say nothing about the training time and tuning effort spent on each.

As far as I understand and in the absence of further data, I think AlphaZero is still the superior choice to solve games with known rules, especially if you don't have DeepMind's level of computing resources.

If anyone knows better about this, I would be happy to be proven wrong though.