| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by noirdujour 1254 days ago
	This touches a little on philosophical difference, but I have found that one major difference is the quantification of uncertainty. Bayesian models, being based naturally around modeling a probability distribution, easily enables the researcher to make claims such as "I am 95% certain the mean lies between x and y, given the data." On the other hand, neural networks, decision trees, and other models more associated with ML than Bayesian statistics do not have this capability built in. (Though there are variations and techniques to build confidence intervals and such with them.)

1 comments

lalaland1125 1254 days ago

This isn't true. When you fit a neural network, you are almost always fitting a probability distribution (Bernoulli for binary outcomes, normal for numeric, etc, etc) which can all provide your probability estimates.

When a model for a binary outcomes returns 0.9 for a given data point, that implies a 90% probably that the value is true.

Evaluating the quality of this estimates (often called measuring the calibration of the model) is even very common.

(There are some exception of course. Max margin models aren't probabilistic. And sometimes people use fixed variance parameters for their normal models, etc).

steppi 1254 days ago

This isn’t what your parent is saying. Many machine learning models are capable of producing calibrated probabilities. What Bayesian models give on top of this is that one doesn’t just predict a probability p, but a posterior distribution for p. This allows for estimating confidence bands around p that quantify ones uncertainty in the estimate. This is useful for assessing data drift and detecting anomalous examples. One can also get such uncertainty bands from non-Bayesian methods by considering ensembles of models. See the review paper by Abdar et al. [0] for more info.

[0] https://www.sciencedirect.com/science/article/pii/S156625352...

lalaland1125 1254 days ago

A "posterior probability for p" for a Bernoulli distribution is meaningless because it's equivalent to a single Bernoulli distribution.

contravariant 1253 days ago

It's meaningful in the sense that what the model produces is often not the posterior probability.

See my other comment for more detail.

steppi 1254 days ago

If one only had such an estimate for a single example, this would be true, but in aggregate over many predictions, the uncertainty bands can useful for decision making. This is an active area of research.

contravariant 1253 days ago

Sure, but you do need to take into account that often fitting a neural network consists of finding the maximum likelihood estimate. So from a Bayesian perspective you're ignoring the prior and you risk overfitting by not considering anything but the most likely alternative. Most attempts to avoid overfitting do not really translate well to the Bayesian perspective.

You can actually recover from this a bit. I saw a paper once where they used the Hessian to approximate the posterior as a gaussian distribution around the maximum likelihood. Can't remember what this paper was called unfortunately.

datastoat 1253 days ago

Probability estimates are not the same thing as uncertainty.

Consider tossing a coin. If I see 2 heads and 2 tails, I might report "the probability of heads is 50%". If you see 2000 heads and 2000 tails you'd also report the SAME probability estimate -- but you'd be more certain than me.

Neural networks give probability estimates. Bayesian methods (and also frequentist methods) give us probability estimates AND uncertainty.

The literature on neural network calibration seems to me to have missed this distinction.

dwiel 1253 days ago

It is common for a network to output the distribution, so the output is both the mean and variance instead of just the mean like you pointed out. For example check out variational autoencoders.

datastoat 1253 days ago

In my example, of predicting a coin toss, the naive output is a probability distribution: it's "Prob(heads)=0.5, Prob(tails)=0.5". This is the distribution that will be produced both by the person who sees 2 heads and 2 tails, and by the person who sees 2000 heads and 2000 tails.

Bayesians use the terms 'aleatoric' and 'epistemic' uncertainty. Aleatoric uncertainty is the part of uncertainty that says "I don't know the outcome, and I wouldn't know it even if I knew the exact model parameters", and epistemic uncertainty says "I don't even know the model".

Your example (outputting a mean and variance) is reporting a probability distribution, and it captures aleatoric uncertainty. When Bayesians talk about uncertainty or confidence, they're referring to model uncertainty -- how confident are you about the mean and the variance that you're reporting?

sdl 1253 days ago

See e.g. Ian Osband's work (he calls it 'risk' VS 'uncertainty' for some good examples that help in differentiating this: https://scholar.google.com/citations?view_op=view_citation&h...

steppi 1253 days ago

The variational autoencoder is a Bayesian model. See [0] for instance.

[0] https://jeffreyling.github.io/2018/01/09/vaes-are-bayesian.h...

dwiel 1253 days ago

Right, the claim was that "Neural networks give probability estimates. Bayesian methods give us probability estimates AND uncertainty" which presents a false dichotomy. I think we agree.

steppi 1253 days ago

Ah yes, got you. It is a false dichotomy because it neglects that there’s such a thing as Bayesian neural networks. Also, taking ensembles of ordinary neural networks with random initializations approximates Bayesian inference in a sense and this is relatively well known I think.

mr_toad 1253 days ago

> The literature on neural network calibration seems to me to have missed this distinction.

I’d hazard a guess that analytical solutions are intractable and numerical solutions would be infeasible.

cjbgkagh 1253 days ago

Depends on the loss function. Softmax final activation into cross entropy loss (or KL divergence) gives probability like predictions. This is a very common set up but there are many others that don’t have this property. I figure that’s what you mean by ‘almost always’. You can also use variational inference where you predict a distribution (usually Gaussian so a sigmoid activation with two values per prediction) and use a Wasserstein loss function and this can be used to get confidence intervals among other things.

PartiallyTyped 1253 days ago

Neural networks are notorious for being overconfident and their predictions shouldn’t be taken at face value ie as probabilities.