Hacker News new | ask | show | jobs
by daddyo 3260 days ago
Feature importance is not quite the same as interpretability.

Random Forests can give feature importance, but that does not account for interactions between features. So, in the end, you don't know how a model made a decision (it could be because there is a feature with high importance, but it could also be because there is an informative interaction between lower importance features).

If you want to compare deep learning with linear models, you should leave image data out of it. Compare them on structured data and bag of words.

MLP's and boosted decision trees, in my experience, definitely beat decision tree and linear models, on structured data. But they lack longterm robustness (complex forecasting models need constant retraining, which can hamper their adoption by business units) and don't pass regulation (it is not enough to say "has_asthma" is a high-importance feature).

In finance and health care, interpretability is enormously valued. It is a constant trade-off between accuracy and interpretability.

A long time ago, Caruana made hospital triage models, with neural networks being the clear winner in generalization performance. Instead, they opted for a simple logistic regression when productionizing. Why?

> [...] patients with pneumonia who have a history of asthma have lower risk of dying from pneumonia than the general population. Needless to say, this rule is counterintuitive. But it reflected a true pattern in the training data: patients with a history of asthma who presented with pneumonia usually were admitted not only to the hospital but directly to the ICU (Intensive Care Unit). The good news is that the aggressive care received by asthmatic pneumonia patients was so effective that it lowered their risk of dying from pneumonia compared to the general population. The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized).

http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

Though there is nothing holding you back from using both simple linear, and complex non-linear models at the same time: Only when the models severely disagree do you pick the interpretable model. Or use the linear model to find data issues, like those mentioned above, that are tremendously obscured (if not impossible to identify) when only using deep learning in a train-test framework.

3 comments

>"In the study, the goal was to predict the probability of death (POD) for patients with pneumonia so that high-risk patients could be admitted to the hospital while low-risk patients were treated as outpatients."

So what they wanted to know is the POD|"No hospital" but they clearly collected data about POD|"Hospital" (since it included ICU admission, etc).

The problem is they measured the wrong thing and then misinterpreted their results. Worse, it looks like the study was designed to be this way!

>"The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized)."

The model learned correctly in this scenario. If you go to the hospital for pneumonia it is apparently in your best interest to claim a history of asthma.

The anecdote about pneumonia and the ICU is pretty puzzling. Why wasn't submission to the ICU one of the classification "labels"?
Here is a talk about that paper: https://www.youtube.com/watch?v=UqPcq0n59rQ

I see that it has also gotten mainstream news coverage as some kind of lesson about the dangers of machine learning. The real problem is they didn't have data that could answer the question they had, P(Death|No hospitalization), so instead they fit models to answer a different question, P(Death|Hospitalization).

Then they didn't like that the complex models answered the second question too well, so they used simpler ones that made it easier to manually filter out any results that didn't make sense as answers to the first question (which isn't one they could answer to begin with).

No model they fit is safe. You could only use one limited to domains where P(Death|No hospitalization) ~ P(Death|Hospitalization), which isn't something they assessed.