Hacker News new | ask | show | jobs
by therajiv 3261 days ago
The author discusses how linear models are generally more interpretable than deep learning methods, but I'd argue that's actually changing pretty quickly. Especially for large image/sequence inputs (which covers most of the applications that are getting hyped up), linear regressions don't perform very well, and often that performance difference prevents them from picking out important features. Given that fast, scalable methods for feature importance are on the rise (e.g. https://arxiv.org/abs/1704.02685, which the author mentions), you often get equally interpretable feature scores from deep models that are more accurate than analogous ones from linear models.

Basically, my point is that model interpretation strongly depends on how accurate your model is, and because deep learning models are so much better than linear models for some tasks, it makes sense to use them - even if your primary goal is interpretability.

That said, I do believe that if you ever care at all about interpretation, you should almost never be using multilayer perceptrons (which have recently become part of the widening umbrella term "deep learning"), because they rarely work better than decision tree models or basic linear models (and MLPs are generally less or equally as interpretable when compared to traditional methods).

2 comments

Feature importance is not quite the same as interpretability.

Random Forests can give feature importance, but that does not account for interactions between features. So, in the end, you don't know how a model made a decision (it could be because there is a feature with high importance, but it could also be because there is an informative interaction between lower importance features).

If you want to compare deep learning with linear models, you should leave image data out of it. Compare them on structured data and bag of words.

MLP's and boosted decision trees, in my experience, definitely beat decision tree and linear models, on structured data. But they lack longterm robustness (complex forecasting models need constant retraining, which can hamper their adoption by business units) and don't pass regulation (it is not enough to say "has_asthma" is a high-importance feature).

In finance and health care, interpretability is enormously valued. It is a constant trade-off between accuracy and interpretability.

A long time ago, Caruana made hospital triage models, with neural networks being the clear winner in generalization performance. Instead, they opted for a simple logistic regression when productionizing. Why?

> [...] patients with pneumonia who have a history of asthma have lower risk of dying from pneumonia than the general population. Needless to say, this rule is counterintuitive. But it reflected a true pattern in the training data: patients with a history of asthma who presented with pneumonia usually were admitted not only to the hospital but directly to the ICU (Intensive Care Unit). The good news is that the aggressive care received by asthmatic pneumonia patients was so effective that it lowered their risk of dying from pneumonia compared to the general population. The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized).

http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

Though there is nothing holding you back from using both simple linear, and complex non-linear models at the same time: Only when the models severely disagree do you pick the interpretable model. Or use the linear model to find data issues, like those mentioned above, that are tremendously obscured (if not impossible to identify) when only using deep learning in a train-test framework.

>"In the study, the goal was to predict the probability of death (POD) for patients with pneumonia so that high-risk patients could be admitted to the hospital while low-risk patients were treated as outpatients."

So what they wanted to know is the POD|"No hospital" but they clearly collected data about POD|"Hospital" (since it included ICU admission, etc).

The problem is they measured the wrong thing and then misinterpreted their results. Worse, it looks like the study was designed to be this way!

>"The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized)."

The model learned correctly in this scenario. If you go to the hospital for pneumonia it is apparently in your best interest to claim a history of asthma.

The anecdote about pneumonia and the ICU is pretty puzzling. Why wasn't submission to the ICU one of the classification "labels"?
Here is a talk about that paper: https://www.youtube.com/watch?v=UqPcq0n59rQ

I see that it has also gotten mainstream news coverage as some kind of lesson about the dangers of machine learning. The real problem is they didn't have data that could answer the question they had, P(Death|No hospitalization), so instead they fit models to answer a different question, P(Death|Hospitalization).

Then they didn't like that the complex models answered the second question too well, so they used simpler ones that made it easier to manually filter out any results that didn't make sense as answers to the first question (which isn't one they could answer to begin with).

No model they fit is safe. You could only use one limited to domains where P(Death|No hospitalization) ~ P(Death|Hospitalization), which isn't something they assessed.

But what is the situation in real life? Can I get some feature importance scores say from tensorflow model?
I'll agree with you that it's much harder than it should be (thankfully, finding the implementations is the hard part, not using them), but yes, these methods do exist.

DeepLIFT (the method I linked in my original comment: https://github.com/kundajelab/deeplift), takes a Keras model (with Theano or TensorFlow backend) as input and provides feature importance scores for any desired layer of the network (raw data inputs, inputs to dense layers following convolution, etc.). Keras-Vis (https://github.com/raghakot/keras-vis) is another nice package that allows for easy visualization of saliency maps and convolutional filters. Perturbing inputs and looking at the effect on the output of the network is another technique people use pretty often.

I think there's a lot of room for this space to become easier to use, especially for newer deep learning practitioners. To that point, I definitely agree with the author of this blog post.

Thanks for the links - a friend of mine is working on something like DeepLIFT but I hadn't heard of it...