Hacker News new | ask | show | jobs
by randomsearch 3326 days ago
> This might be a dumb question, but let's say that for whatever reason on a specific problem it's much easier to train a neural network that generalizes well than a decision tree. Why not train the network, then build an equivalent decision tree that just tries to reproduce the network's output? When building the tree from the network, overfitting would not be a concern. In fact, you'd want it to overfit.

You haven't fixed anything here. You've just encoded your training data in a neural net and then presented the same problem to the decision tree learner. Unless you're planning to transform your training data somehow?

2 comments

I don't think I follow here. The goal of training the network isn't to encode the training data in the model, but rather to build a model that generalizes well. If the neural network has just memorized the training examples, then it overfit and really isn't useful in the real world.

I'm imagining a hypothetical example where generalization is easier to achieve with a neural network than with a decision tree using standard training techniques. Then a tree trained on the network might generalize better than a tree trained straight on the original data, with the additional benefit of being less of a black box than the network.

This solves the problem of interpretability. You can't interpret the weights of a neural network, but you can easily follow along a decision tree and see if it's doing what you want.

Actually that's somewhat less true for big decision trees. But the general point is that you can train interpretable models to mimic the output of uninterpretable black boxes.

The biggest issue is that decision trees only work for data with fixed inputs and outputs. Recurrent NNs work on a time series and possibly even have attention mechanisms.

No, it doesn't make sense. The training data (inputs and NN-predicted outputs) that you're feeding into the DT is at best the same as the training data (inputs and desired outputs) you had originally.
You can generate infinite training data with the NN by feeding in random inputs and seeing what outputs it gives. You can then train whatever model you want on it without concern for overfitting.

But more importantly, the decision tree will model the behavior of the NN, not necessarily the original data. Which is what you want, if your goal is to understand what function the NN has learned.

The point about infinite training data is potentially useful. The other one I still don't agree with. Your goal is only to understand the NN insofar as it models the original data. Any errors the NN is making are not worth learning about. So it would be better to train the understandable method (DT) on the original data.
>Any errors the NN is making are not worth learning about.

But that's the whole point of this method! To understand what errors the NN might be making. It's also quite possible the NN's errors aren't really errors, if there are mistakes or noise in the labels.

This technique has been called "dark knowledge" and is really interesting. See http://www.kdnuggets.com/2015/05/dark-knowledge-neural-netwo... They train much simpler models to get the same accuracy as much bigger models, just by copying the predictions of the bigger model on the same data. In fact you can get crazy results like this:

>When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.

Ah, very interesting! I agree that would be useful. But I think this thread has ended up with a proposal very different from the one I started replying to.