Hacker News new | ask | show | jobs
by wodenokoto 4212 days ago
The problem here is very much effectiveness per time unit, as the author writes. It would probably need to be re-implemented in optimized C code in order to truly test it against other optimized engines.

I think the take away here is how relatively easy it is to make a decent AI for a complex game using neural networks.

2 comments

I do not think that is correct. With chess, you face exponential branching that quickly overwhelms your language-dependent speedup.

Is your python code e.g. 10x times slower than C code? With a branching factor of e.g. 10, that would mean you only search to depth 9 ply whilst C would search to 10 ply [a]. You are not losing that much! A ballpark ELO grade would not vary qualitatively between python and C.

[a] With alpha-beta search, maybe it is 8 ply vs 10 ply. Also the branching factor is larger than 10. But the point remains valid.

It's using theano that generates CUDA code. Before reimplementing I would start with profiling the generated code to spot the main computational bottlenecks.
Actually:

> Faster evaluation function: I didn’t use the GPU for playing, only for training.

So indeed it could probably be optimized. But first I would profile as it can be the case that the bottleneck is Matrix Matrix multiplication in numpy which already delegates computation to an optimized third party library (e.g. OpenBLAS). Maybe it's worth using the GPU at play time as well.

"Maybe it's worth using the GPU at play time as well."

The difficulty there is that sending data to and from the GPU is slow. You need to avoid data transfers, which might mean trying to do everything on the GPU. But the SIMD cores on the GPU are likely to perform poorly due to all the branch statements in chess code.

This may be naive, but board state can be encoded very efficiently using FEN[1] - as an example, the starting board state only takes up 57 bytes.

[1]:https://chessprogramming.wikispaces.com/Forsyth-Edwards+Nota...

There is a delay from sending any data to/from the GPU.

Generally you aim to give the GPU a 'large' task (or many small tasks), then ask for the answer(s) in 1 batch and wait while it is transferred across.

If you have many small tasks where each answer is sent back separately, and you need that answer before requesting the next task, then you will be very slow, even if the data sent/received is small. A naive implementation of the chess algorithm here (with alpha-beta pruning (which has many if-then branches)) would be like this :/

There might still be ways to speed up at play time by evaluating batches of candidate points in the same evaluation function call in the GPU.
CUDA is used to learn the evaluation function, at play time it uses a Python implementation of Negamax that uses this evaluation function to estimate how good are the leaf positions.