This might be a dumb question, but let's say that for whatever reason on a specific problem it's much easier to train a neural network that generalizes well than a decision tree. Why not train the network, then build an equivalent decision tree that just tries to reproduce the network's output? When building the tree from the network, overfitting would not be a concern. In fact, you'd want it to overfit.
You could even say that it only needs to approximately reproduce the output with some tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees.
I think usually if a problem is more solvable by a neural network than a decision tree, there is an underlying reason. Neural nets and decision trees work in very different ways.
Take image classification as an example. CNNs can do it by finding nonlinear patterns that exist. A decision tree would have a very tough time doing it because the pixels have a complicated relationship with each other that defines what the image is of.
I think for something like the data transformations we're talking about a Neural Network would be pretty over kill. It looks like this feature in excel works by comparing the data to pre-defined formats, which is probably done by searching all known formats in a somewhat intelligent (not ai, just intelligent) way so that it's fast. Then it can output that type of data in whatever form you want.
Your comment gave me an interesting idea though: What if we put neural networks inside of decision trees?
> This might be a dumb question, but let's say that for whatever reason on a specific problem it's much easier to train a neural network that generalizes well than a decision tree. Why not train the network, then build an equivalent decision tree that just tries to reproduce the network's output? When building the tree from the network, overfitting would not be a concern. In fact, you'd want it to overfit.
You haven't fixed anything here. You've just encoded your training data in a neural net and then presented the same problem to the decision tree learner. Unless you're planning to transform your training data somehow?
I don't think I follow here. The goal of training the network isn't to encode the training data in the model, but rather to build a model that generalizes well. If the neural network has just memorized the training examples, then it overfit and really isn't useful in the real world.
I'm imagining a hypothetical example where generalization is easier to achieve with a neural network than with a decision tree using standard training techniques. Then a tree trained on the network might generalize better than a tree trained straight on the original data, with the additional benefit of being less of a black box than the network.
This solves the problem of interpretability. You can't interpret the weights of a neural network, but you can easily follow along a decision tree and see if it's doing what you want.
Actually that's somewhat less true for big decision trees. But the general point is that you can train interpretable models to mimic the output of uninterpretable black boxes.
The biggest issue is that decision trees only work for data with fixed inputs and outputs. Recurrent NNs work on a time series and possibly even have attention mechanisms.
No, it doesn't make sense. The training data (inputs and NN-predicted outputs) that you're feeding into the DT is at best the same as the training data (inputs and desired outputs) you had originally.
You can generate infinite training data with the NN by feeding in random inputs and seeing what outputs it gives. You can then train whatever model you want on it without concern for overfitting.
But more importantly, the decision tree will model the behavior of the NN, not necessarily the original data. Which is what you want, if your goal is to understand what function the NN has learned.
The point about infinite training data is potentially useful. The other one I still don't agree with. Your goal is only to understand the NN insofar as it models the original data. Any errors the NN is making are not worth learning about. So it would be better to train the understandable method (DT) on the original data.
> When building the tree from the network, overfitting would not be a concern.
True
> tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees
True
However, my guess is you'll wind up doing only one of (a) having more accurate tree model than training it directory (b) improve the understand ability of your model significantly.
Which is why learning algorithms that produce decision trees are far smarter in the long run. Neural nets might eke out other benefits but there's a lot to be said about justifiable/accountable decisions.
Correct me if I'm wrong, but the only type of decision tree that is comparable to a NN in terms of performance is an ensemble of decision trees, and these are equally hard to interpret as NNs.
It's theoretically possible for a neural net to do this; the network just needs to have the explanation as an output. I agree that decision trees would be more reliable and easier to train, but I'm not sure if hardcoding every feature is scalable.
How do you know that the explanation jives with the other outputs, though? It seems like a turtles-all-the-way-down situation, because now I want to see how it was properly introspective of its own decision making.
Also seems like it’s another magnitude of complexity in the neural net to have it not only train and learn on your inputs, but also train and learn on its own training and learning.
Neural networks don't do anything as sophisticated as self-referential introspection. They just fit the outputs you train them with. The training data you provide would have to include the desired explanations.
Consistency is enforced by the dataset, and also by the model. Both outputs would read from the same hidden layer--the one that encodes the desired transformation.
You could even say that it only needs to approximately reproduce the output with some tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees.