The author says not everything should be differentiable. Intuitively, I agree, but the question is how to do a sufficiently fast search through a high-dimensional space when you don't have a gradient.
If you don't have a gradient, one tactic is to make the most of the situation. Give your model the Bayesian treatment, and sample from the posterior using MCMC. This is slow, but you end up with posteriors on your parameter values, which is a huge win.
Yeah, I've been a big fan of probabilistic programming for a while. The real problem is that getting Monte Carlo methods to converge and produce a large sample from the posterior takes orders of magnitude more time than running an optimizer to descend a gradient. Hey, you can even make it a probabilistic gradient: variational inference! But then you still have a hard time with discrete, nondifferentiable structure.
Guided by domain-independent lowerbound functions, obviously. There are lots of research in this area, see every year there are papers on heuristic search in AAAI. ML people are simply not reading them and citing them.
Could you point some out to me? I haven't been deliberately avoiding them by any means, but I'd like to see some that allow fast search. And especially, if you've got it, fast stochastic search for Monte Carlo methods.
I bet you already know MCTS. But I see you have a problem with defining "fast". Define "fast". You can usually only say something is faster or slower.
Absolute speed depends on the machine resource and the low-level optimization. If you take MCTS it is "enough fast" considering it has already beat Lee Sedol.
There are multiple means to measure the speed. Blind search is "fast" re: node expansion but takes exponential time to find a solution and quite likely exhausts memory. Search with heavy heuristic functions has "slow" expansion but can solve the problem quickly, despite the cost of evaluating them, since it can prune a large part of the state space and expands fewer nodes. Only the latter is practically meaningful.
The best search methods allow reversible state updates. The reversibility makes things super-fast -- you no longer need to copy all data structures on path expansion. Instead, you modify only a single representation, incrementally. And when retracting a search step, you "undo" the same modifications again, arriving at the exact same state as when you started.
This is of course non-trivial -- it is much easier to copy everything, then throw away the entire copy when it's not needed, rather than keeping a single state incrementally consistent. But the effects due to data locality (excellent caching), better memory management (no allocations, fragmentation) and less work (only touch and update parts of the state that matter) can be tremendous.
I haven't seen any discussion of DeepMind's implementation details for AlphaGo, but since they come from a game development background (David Silver was the CTO and lead dev at Elixir Studios), where each cycle counts, I have no doubt they're well familiar with all these concepts. But then the TPU throws a wrench into it again...
That idea is nearly 30 yrs old, it is the whole point of IDA* (Korf 85) and maybe Frontier Search (Korf 99) too. Their benefits, drawbacks and the cure are all well studied. I'm surprised some people who discuss AI do not know these famous algorithms.
MCTS's playouts do not need to backtrack (they are just the greedy probes), so it is irrelevant. By backtrack, do not confuse it with the backpropagaton in MCTS.