Hacker News new | ask | show | jobs
by cgreerrun 2190 days ago
I've been working on a Python implementation that uses Gradient Boosted Decision Trees (LightGBM/Treelite) instead of using a neural network for the value/policy models:

https://github.com/cgreer/alpha-zero-boosted

It's mostly to understand how AlphaZero&Friends work. I'm also curious about how well a GBDT could do, and if there are self-play techniques that can accelerate training.

The nice thing about a GBDT is that, unlike when using a NN, you can do thousands of value/policy lookups per second on a single core. So it should be cheaper to scale self-play and run a lot of self-play experiments (assuming the self-play learnings when using the GBDT model transfer to when you use the more-powerful NN in these environments).

If you're curious about accelerating self-play training, check out David Wu's work (https://arxiv.org/pdf/1902.10565.pdf). He's the creator of KataGo. I implemented his "Playout Cap Randomization" technique in my implementation above and, sure enough, it's much more efficient: https://imgur.com/a/epaKtDY. It seems like it's still early days in terms of how efficient self-play training is.

4 comments

This is very interesting! If your experiments work out, I would be interested in adding "Gradient Boosted Decision Trees" support to AlphaZero.jl.

I saw the work on KataGo and implementing "Playout Cap Randomization" is indeed on my TODO list.

I'll be sure to let you know how it goes. I have a few self-play experiments that I'm trying out once I get the testing setup finished.

Do you know if there is a discord channel for all of us AlphaZero nerds that I'm missing? If not, we should make one. It'd be great to bounce ideas off each other.

Thanks! You can easily find my email address from my github. I am not aware of a canonical discord channel for discussing AlphaZero but the Lc0 community has a discord channel on which I had some interesting conversations.
This is really interesting, thanks for sharing!

I've been thinking about extensions to decision tree models that could get the benefits of NNs and it seems like there are a few ideas floating around.

For example; Probabilistic Random Forests have some really interesting properties for noisy datasets, e.g. "The PRF accuracy decreased by less then 5% for a dataset with as many as 45% misclassified objects, compared to a clean dataset." - https://arxiv.org/abs/1811.05994

PRF's might be a natural fit for RL, especially methods using monte carlo tree search.

Speculating here as I'm not adequately familiar with stochastic calculus, but intuitively it seems like probabilistic decision trees could be made differentiable since the hard decision threshold in a tree could be a turned continuous (i.e. every split is logistic regression), which might enable some really interesting applications. I personally dream of being able to cleanly integrate decision trees and tools from NNs in something like pyro for a fully Bayesian model.

A fast, powerful bayesian model seems like it would be a game-changer. PUCT (the heart of the AlphaZero MCTS that decides which action to choose) really seems setup to model the action choices as a multinomial bayesian inference problem (it already updates the action priors with Dir noise).

Thanks for the link! I don't really know anything about the world of probabilistic trees. I'll check it out.

The only bayesian approach to decision trees I'm familiar with is BART (https://projecteuclid.org/download/pdfview_1/euclid.aoas/127...). I haven't used them, but I'm guessing because it uses MCMC to update the params it's not super fast. I've seen them used in causality applications for partial dependency plots where it's convenient to convey the certainty of a variable's effect.

Check out the SoftBART method, I think there are interesting optimizations possible there. XBart also looks promising as an approach.
There appears to be a LogitBoost tree that does what you say, if I understand you correctly.

[1] https://en.wikipedia.org/wiki/LogitBoost

Thanks for the reference, from a quick read it seems LogitBoost is a booster for ensembling models under a logistic loss.

I meant that the splitting point in a node in the decision trees that make up a random forest is itself a random variable that follows a distribution. Because its a smooth function (i.e. probability of splitting is 50% at the point and rises/falls smoothly) it should in principle be differentiable* so that the whole model can be trained by SGD and/or fit into an end to end learning pipeline with convolutional layers etc.

*Where I'm hazy, is how can a smooth probability function be differentiable when sampling. I'm brainstorming in the open here, will do some reading on stochastic neural networks.

Its great when you can indeed iterate without 100s of GPU/hrs.

Are there any papers/comparisons/tradeoffs on when GBDT predictive-power plateaus compared to a NN?

EDIT: with self play you can trade-off a cpu budget for both the GBDT depth, a NN depth, and the roll-out depth - which is super interesting

> Are there any papers/comparisons/tradeoffs on when GBDT predictive-power plateaus compared to a NN?

None specifically that I know of, but I haven't searched.

"Shallow learning" GBDTs can do pretty well on MNIST (https://www.kaggle.com/c/digit-recognizer/discussion/61480), getting 98%+ accuracy compared to the 99%+ of NNs. So I figured if they can handle MNIST, they can probably handle connect 4, and would be useful to explore the self-play training efficiency aspects of AlphaZero (at orders of magnitude less compute time/cost)

> with self play you can trade-off a cpu budget for both the GBDT depth, a NN depth, and the roll-out depth - which is super interesting.

Definitely. It'll be interesting to see if a deeper MCTS search with a less powerful model can do pretty well. I'm still fairly ignorant about the MCTS literature, but I've definitely seen MCTS married to other value/policy models (linear regressions, for e.g.) that used large numbers of playouts years before Alpha Go came out. Those didn't work out, so seems like the DL aspect of Alpha Zero is somewhat essential to be able to learn games as complex as Go.

I am wondering if your idea of using GBDTs in combination with AlphaZero might not be most influential in areas where no neural network architecture is known to provide the right inductive bias for the problem at hand.

I think neural models are pretty unbeatable in many classic RL environments because convolutional neural networks are REALLY good at learning visual representations. In some sense, I suspect that the great success of AlphaGo Zero comes in big part from the fact that it really makes sense to analyze a Go board as a 2D image using convolutional networks: convolutional networks provide the right inductive bias for the problem of learning to play Go.

However, there are tasks where neural network are not as good, such as symbolic manipulation tasks (I am in a good position to know this as I'm doing research in the area of automated theorem proving). I would be very curious to see how your approach fares for those tasks.

> because convolutional networks provide the right inductive bias for the problem of learning to play Go.

Agreed, seems like there will be a set of environments that GBDT won't work on and a CNN will. I'm only mildly familiar with CNNs, but one thing I wanted to try was to add convolutional features to a GBDT some way. I'm sure it's been tried and worked/failed before, but it would be an interesting exercise (at least for me) to learn why/if it works.

It'd be valuable if there was a way to get 95% of the benefits of CNNs (scale/position invariance, etc.) with less compute, even if the accuracy was lower. Right now the GBDT is probably making a ton of redundant split points for connect 4 to re-detect the same position-shifted pattern. There's gotta be a somewhat generalizable way to at least make that more efficient.

> such as symbolic manipulation tasks (I am in a good position to know this as I'm doing research in the area of automated theorem proving). I would be very curious to see how your approach fares for those tasks.

Definitely! I really want to see a non-game application. I think most people think of AlphaZero as "a cool technique to make SOTA board game agents", but to me it excites me because it seems like a very generalizable technique that'll be applied to other important types of problems.

I know as much as Jon Snow about symbolic manipulation, but if you have an example symbolic manipulation problem that can be shoehorned into an environment (state, actions, transitions, rewards) I'd be down to code it up and see how it does.

how good is your AI so far?
I'm trying to answer that right now, actually.

For connect 4, once it's trained a bit it seems to do really well. At 800 MCTS playouts (<.4s), it goes from "I can beat it and sometimes we draw" before training to "I pray for a draw, it almost always beats me" after ~4 hours of training.

Connect 4 is a solved game, so it should be possible to sample the space of the trillions of (position, who should win?, what are the best move(s)?) tuples and compare the answers to your value/policy models to get some kind of objective error. I haven't had time to do that, but having that benchmark is nice to have so you don't have to do a "ladder tournament" against some reference bot(s) like you do for Go where you don't know what ideal play is.

After training it for 10 hours on Quoridor (using my personal laptop), it still can't beat me, but it doesn't seem anywhere close to plateauing. It goes from the agents aimlessly wandering around the board looking for victory row and randomly placing walls, to putting walls that thwart the opponent and navigating to the victory row.

I decided to implement PCR and try out some self-play techniques on Connect Four before I give it another go for Quoridor; a few days of self-play improvements can speedup training 10x. That's where I'm at now...

Once I test a few strategies I was thinking of firing up a c5a24x, 96-core box on AWS and giving it another go. It's ~1-2$/hr at the spot price so I can probably do a lot of damage for 50$ or so.