Hacker News new | ask | show | jobs
by andy_xor_andrew 921 days ago
interesting how MCTS decoding is called out. that seems entirely like a software aspect, which doesn't depend on a particular chip design?

and on the topic of MCTS decoding, I've heard lots of smart people suggest it, but I've yet to see any serious implementation of it. it seems like such an obviously good way to select tokens, you'd think it would be standard in vllm, TGI, llama.cpp, etc. But none of them seem to use it. Perhaps people have tried it and it just don't work as well as you would think?

2 comments

It’s very difficult to implement, and requires training the network to use it.

I worked at DeepMind on projects that used MCTS. Even with access to the AlphaZero source code, it was very difficult to write an other implementation that got the same results as the original.

I'm really curious about this part:

> and requires training the network to use it.

I thought one of the benefits of MCTS was, if you already have your value network, then a general MCTS implementation can walk the tree of values created by that network. And so no special update to the model is necessary. But I'm probably wrong about this.

(also, it boosts my confidence to hear that even folks at DeepMind find MCTS difficult to implement :D Because I tried to implement a simple MCTS a few years back for a very small toy project. I was following a step-by-step explanation of how it worked, and even still, it was super difficult, and very prone to subtle bugs)

Ah, well you could use a standard value network, but it’d end really slow, so you probably want to train a smaller one and rely on the implicit ensembling that MCTS does to make it better.

In my experience, PUCT does a lot better than UCT, so you want to also have a prior network.

You don’t have to train a new network, but in my experience, it works much better. I haven’t spent a ton of time using off the shelf networks with MCTS though. Maybe it works great.

very subtle bugs is the MCTS experience. Particularly once parallelism is involved.

really interesting! thanks for the info!
Doesn't MCTS imply that you'd have to generate a whole tree of tokens? Instead of maybe a 200 token response, you'd have to generate several thousand tokens as you explore the tree?