Hacker News new | ask | show | jobs
by code_biologist 1622 days ago
Great answer. It prompts a bunch of followup questions!

most DS/ML scientists use ML models that typically don't provide an easy way to estimate uncertainty intervals

Not an DS/ML scientist but a data engineer. The models I've used have been pretty much "slap it into XGBoost with k-fold CV, call it done" — an easy black box. Is there any model or approach you like to estimate uncertainty with similar ease?

I've seen uncertainty interval / quantile regression done using XGBoost, but it isn't out of the box. I've also been trying to learn some Bayesian modeling, but definitely don't feel handy enough to apply it to random problems needing quick answers at work.

1 comments

Correct, quantile regression is an option. Another is "pure" bootstrapping (you can see by googling something like uncertainty + machine learning + bootstrapping that this is a very active area of current research).

The major problem with bootstrapping is the computational time for big models, since many models need to be fit to obtain a representative distribution of predictions.

Now, if you want more "rigorous" quantification of uncertainty, one option is to go Bayesian using probabilistic programming (PyMC, Stan, TMB), but computational time for large models can be prohibitive. Another option is to "scale down" the complexity to models that might be (on average) a bit less accurate, but provide rigorous uncertainty intervals and good interpretability of results, for example Generalized Additive Models.

A note here is that I saw certain quantification of uncertainty by people who were considered very capable in the ML community that gave me goosebumps, for example since the lower bound of the interval was a negative number and the response variable modeled could not be negative, the uncertainty interval was "cut" at zero (one easy way to deal with it, although it depends on the variable modeled and the model itself, is log-transforming the response—but pay attention to intervals when exp(log(y)) to get back to the natural scale. Another useful interview question.)