Hacker News new | ask | show | jobs
by tnone 3410 days ago
Nobody seems to be capable of explaining this properly. It's like monad tutorials, they explain what happens while mistakenly thinking they are telling you why it happens. I keep trying to fit this idea into my head and I can't because the information is not given.

- Where did this difference come from? When did it develop?

- What are the basic premises that a Bayesian believes that a frequentist doesn't, and vice versa? Reason it all the way through front and back.

- What does the B/F's model look like? What are the pieces they use, how are they arranged, what are the dependencies, how does causality flow?

- Why are the choices made by one invalid for the other's model? Where do they agree deliberately despite this?

- What are the consequences in the real world? Give me a real example on why this difference matters? "Real" meaning I don't care about dice, I care about engineering and science.

Instead you get some bullshit about fitting a distribution you don't understand to a model you can't see, while relying on understanding the nuances between words like probability and likelihood which is what you are trying to learn in the first place. Plus I swear the numbers agree in 99% of the "examples" given, with some handwaving "but it's different" to excuse it.

Fucking explanations, how do they work? Not in academia.

10 comments

Ooooookay. So, very long story short...

They're two different academic traditions for what constitutes Good Statistics. They're originally rooted in the philosophical dispute over whether to treat probabilities as frequencies of random outcomes ("frequentist") or as degrees of plausibility ("Bayesian").

In actual fact, a well-trained frequentist knows exactly how and when to use Bayes' rule for gambling, and a well-trained Bayesian knows exactly how and when to publish a paper with a p-value.

The really important difference is over how a whole field expresses its consensus or tradition about what constitutes strong evidence or a plausible theory. A Bayesian would like researchers to elicit priors before experiments (which express something like what reviewers' expectations will be about the experiment), and then calculate posterior distributions after experiments. We could thus then trade off "weak" and "strong" experiments against prior beliefs, while also reducing publication bias' pernicious effect on statistical strength -- or so Bayesians claim. Bayesian methods are also usually more computationally intensive and can make use of small sample sizes.

Frequentists had a lot of disagreements with that sort of thing, and so Neyman-Pierce and Fisher and the like developed a whole lot of statistical methods that don't rely on ever treating a probability as a belief. They preferred to differentiate clearly between a frequency of experimental outcomes, and what researchers think. They figured that Bayesian "priors" were subjective, biased, and untrustworthy. Also, quite importantly, their methods involved a lot less rote computation and instead made use of impressively large experimental samples.

Depending on which tradition you were raised in, and which philosophers of science you side with, you can argue until the end of the world about which one's better. My advice? Use whatever your field demands you use to publish, but be Bayesian on the inside.

I'm not a statistician, and have only studied frequentist statistics (I assume that's the standard taught in introductory stats courses in school).

Like the person at the root of this thread, I have struggled with explanations on why Bayesian is so great. The answers that worry me tend to be along the lines of "Well, suppose you want the probability for event X (typically a "one-off" event). Frequentist statistics cannot give you an answer (one-off events have no distribution to speak of). But with Bayesian statistics, I can compute a probability for it!"

Yes, but as someone else has pointed out, what the heck do you mean by "probability"? Frequentist statistics is fairly clear on the definition. The whole argument given above seems like he is happy he has some mechanism to get an answer, with little thought about whether he is asking a meaningful question.

Which is why your comment resonates with me:

>They preferred to differentiate clearly between a frequency of experimental outcomes, and what researchers think. They figured that Bayesian "priors" were subjective, biased, and untrustworthy.

I don't want an answer that's dependent on how the person thought. That definitely comes across as subjective to me.

>I don't want an answer that's dependent on how the person thought. That definitely comes across as subjective to me.

Then I think you'll be somewhat disappointed when you learn more about philosophy of science and the core debates over methodology. The biggest problem is: nothing is purely objective. Everything involves assumptions of some sort, otherwise we run head-on into the Problem of Induction, white ravens, No Free Lunch Theorems (on the more machine-learny side), and other such problems.

>Yes, but as someone else has pointed out, what the heck do you mean by "probability"? Frequentist statistics is fairly clear on the definition. The whole argument given above seems like he is happy he has some mechanism to get an answer, with little thought about whether he is asking a meaningful question.

I don't think frequentist statistics are very clear here at all! A p-value, after all, is a likelihood, which frequentist statisticians insist is not a probability, but which the math clearly says is a conditional probability. So when you get a p<0.05 finding, it never means, "We actually ran this experiment under a control hypothesis N times, for some large N, and fewer than five came out this way." It's a measure of counterfactual outcomes, conditional on an assumption which we pretend to expect to be true. When the p-value is small, we then pretend to be surprised, and pretend to make an interesting inference.

I say "pretend" because an ordinary NHST is mathematically equivalent to a Bayesian credible hypothesis test with a uniform prior over the hypotheses. Performing the frequentist test involves pretending to believe that uniform prior, even though you probably actually set up the experiment in order to obtain a significant p-value.

In the end, the NHST is a chiefly social practice, and the p-value is chiefly social evidence. It's a way of convincing peer reviewers to accept (that is, subjectively believe) that you did a real experiment, when they would otherwise skeptically believe that you made it all up (which, unfortunately, some researchers have been known to do!).

Bayesian methods don't get rid of this subjective, social component to science and make everything "objective", any more than you can do that by hiring Mr. Spock to do your statistics. Bayesian methods drag the subjective, social component of prior elicitation out into the sunlight where everyone involved has to acknowledge it. They also give you numbers that are actually about the experiment you really did, as opposed to measuring your experiment against an infinity of counterfactual experiments you never really performed.

(And also they're easier with small sample sizes, their results are more intuitive to interpret, and generative models are more intuitive to think about than test statistics.)

All that said, I totally have used frequentist statistics (took a very similar class to yours) when called upon to do so. Fighting a philosophy-of-statistics holy war against your higher-ups in the workplace hierarchy is a really bad idea, so however nice Bayesian or frequentism might sound, sometimes you buckle down and do what ships products and publishes papers.

Your criticism of p-value usage is legitimate. However, this is not core to frequentist statistics.

When I first encountered p-values, even with a frequentist mindset, I saw the huge problem that one could have with them. Many frequentists do not like p-values. I wouldn't be surprised if most actual frequentist statisticians (not those in fields like medicine, psychology, etc) do not like p-value usage.

Attacking p-values is not a valid argument against frequentist statistics.

I'll also add that it seems that many Bayesians are really dying for a number, and because frequentist stats doesn't give it to them, they reach for another tool that will - but with little thought about the validity of the tool. I'm not here to defend frequentist statistics, but just because it doesn't give all the answers, that does not mean that some other tool that does give some answers is correct.

It is equally abusable as p-values. I suppose if a Bayesian says he used Bayesian approaches because it made sense given his problem, that's fine (and in my mind, he is just being a statistician, not a Bayesian). The self-identified Bayesians I always encounter don't fall into that mold. They fall into the category of "Look what I can compute that I could not with frequentist statistics" - but any attempts I have to understand what that number means fails - they cannot explain it either, beyond "this is how I feel".

I'm not really trying to make an argument against frequentist statistics and for Bayesian ones. I'm more trying to point out what each style exposes (by printing it in your papers) or conceals (by leaving it semi-consciously understood from that one class in grad school).
> be Bayesian on the inside

Strongly disagree, tbh. Picking one side or the other in this debate is silly. Don't "be Frequentist" so as to avoid Bayesian model building techniques since you'll end up stuck all the time and don't "be Bayesian" so as to look down upon simple, workable, un-motivated estimation procedures with good performance.

I didn't mean "look down upon ... workable .. procedures with good performance." I meant a more commonsense sort of "private Bayesianism", where you maintain a healthy skepticism of things that have always failed before, and a healthy reliance on things that have always worked before, even when public scientific discourse purports to show you very strong non-Bayesian evidence.

For example, back in my MSc days, I would run a whole lot of metrics on our dataset, and look for patterns. Sometimes I would find a strong, interesting pattern, and go try to tell my advisor about it. He would ask me to double-check my code for bugs, rerun things, and see if the pattern was still there. Often, it wasn't.

My advisor was nobody's Bayesian, a frequentist (and even a user of purely descriptive statistics, oftentimes) through and through.

So to me, "Bayesian on the inside" ends up meaning, "at least Bayesian enough to look for experimental errors." This attitude has helped me a lot in debugging difficult snafus in industry, too.

I see what you're saying now and quite like that. Thanks for clarifying!
The Bayesian believes that probability represents our beliefs about the world. The Frequentist believes that probabilities merely represent the long term frequency counts of events.
>The Bayesian believes that probability represents our beliefs about the world.

But what if our beliefs differ?

On several occasions, while tutoring friends who were taking introductory probability, they'd be posed with a HW problem. They would compute the answer in two different ways, and occasionally get two different answers. Both methods seemed correct to them, but they were not - one was always wrong. I used to argue with them about their reasoning on the incorrect answer, but it didn't help much.

What did help? Just doing the homework problem in real life, with a reasonable number of samples. It could be literally in real life or through a computer simulation. The result would always closely agree with one of the answers.

That's why I like frequentist statistics. It gives me a way to settle the answer outside of my own belief system.

If you have different beliefs than you'll have different probability distributions. There's nothing wrong with that.

Subject to a few technical requirements (basically absolute continuity of priors), it's a theorem in that your posteriors will eventually converge as more evidence is gathered.

That's why I like frequentist statistics. It gives me a way to settle the answer outside of my own belief system.

Can you explain this? To me this makes no sense - as a Bayesian I run simulations too.

Neither of these, though, is wrong. Sometimes you care about beliefs, sometimes you care about the long runs....
Part of the problem is that bayesian vs frequentist is one of those things like MWI vs copenhagen or the oxford comma: a certain group of people read a thing and decide having a stance on X makes them part of an in-group. They then flaunt that opinion despite never actually being a statistician/quantum physicist/grammarian or ever actually running up against situation where either option matters in their real life.
The most obvious thing that nobody seems to ever explain is that a statistician can use both frequentist and Bayesian methods. In fact, most good ones do. Frequentist methods are generally better for finding the needle in the haystack, while Bayesian methods are generally better at proving that it's actually a needle and not a piece of painted hay.

Re: Where did the difference come from, that's down to different interpretations of probability. The frequentist interpretation says that probability describes the world, whereas the Bayesian interpretation describes our beliefs. Here's another common misconception: You don't need to subscribe to one interpretation to the exclusion of the other. People who use the Copenhagen interpretation of quantum mechanics (a frequentist formulation if ever there was one) will also speak of fractional belief (the definition of Bayesian probability). It is important to be clear about which interpretation you're using at any one time, but you don't need to tie yourself to one interpretation, and it doesn't need to be part of your identity or world-view.

I agree, and suspect many people (outside of the statisticians who have had the time and space to digest the philosophical underpinnings) who claim to love Bayesian methods do so because they have been told it's the right thing to love. There are a lot of hand-wavy explanations out there that tell you what each side believes, but I have yet to see something that truly ELI5s it.
Well, maybe I'm an exception since I am a statistician, but Bayesian techniques allow me to do things which are simply impossible with frequentist tools. To be fair, I use both approaches just about every day.
"I use both approaches just about every day" is the only sane answer here. Different tools for different jobs. What would think if you met carpenters who described themselves as "Hammerists" or "Sawsallitarians"?
Well, there are philosophical arguments for preferring one over another. On the other hand, I've got to get work done and both tools do the job.

I'd say I'm philosophically Bayesian, but frequentist techniques are often more convenient.

There are situations where one does care about long-run frequencies though, right?

Perhaps something like quality control, where we want a procedure that only rejects 5% of within-spec parts?

Sure, but there's nothing which would prevent you from addressing that from a Bayesian perspective. In fact, Bayesian particle filtering techniques would probably be a great tool for "on-line" quality assurance.
I'd think they just made it clear how to choose one or the other based on whether I wanted something built and/or destroyed, as opposed to whether I just wanted it cut apart. ;)
Well, if you're looking for a TL;DR to describe all of the differences between Bayesian and Frequentist statistics, while also giving you a history of the theoretical development of each, and you want it to be self contained... you're going to have a bad time.

There are separate theoretical foundations, which can be confusing since both Bayesians and Frequentists use probability theory in the same ways. A short explanation of the foundational difference is that Bayesians and Frequentists use probability in different ways.

To a Frequentist, a probability is nothing more and nothing less than a long run frequency: the proportion of times you expect an event to occur if a random experiment is conducted many times. This proportion is usually conceived of as a true, but unknown, constant. A good Frequentist thus can't describe "the probability that you have cancer", because you either have cancer, or you do not. If you want to see what kind of constraints this places on frequentist descriptions of real-world phenomena, look up the definition of the frequentist confidence interval.

(many) Bayesians trace their probabilistic approach to modeling reality to work done in a decision theoretic context in the early 1900's (https://en.wikipedia.org/wiki/Bayesian_probability#Axiomatic...)

In short, Bayesians claim that:

1. Your beliefs should be describable as probability distributions

2. You should update your beliefs when observing new evidence using Bayes' rule

There are solid theoretical justifications for both of these statements.

To a Bayesian, therefore, it is perfectly sensible to talk about the "probability that you have cancer", because there is uncertainty about the phenomenon.

This discussion is, however, almost completely orthogonal to the "applied" implications of choosing a Bayesian or a Frequentist approach to statistical inference. Some thoughts:

1. Bayesian procedures tend to be more computationally intensive

2. non-degenerate Bayesian prior distributions have the effect of "shrinking" parameter estimates towards some null value, which has benefits in high dimensional problems (see: frequentist Lasso and ridge regression)

3. Bayesian inference makes it easy to think about problems in a conditional fashion (e.g., if I knew what "X" was, I would know how "Y" would behave. If I knew what "Y" was, I would know how "Z" would behave."). This makes it quite easy to specify intuitive, yet complex, models of interesting phenomena.

4. There are conceptual advantages to thinking about things as probability distributions.

5. Eliciting prior distributions is hard, but it is also work that any good statistician should be doing (at least informally) regardless of whether they're a frequentist or a Bayesian.

> if you're looking for a TL;DR to describe all of the differences between Bayesian and Frequentist statistics, while also giving you a history of the theoretical development of each, and you want it to be self contained... you're going to have a bad time.

Yes a million times! This problem is mirrored IMO in many domains requiring somewhat complicated math. You end up with an explanation of many layers of concepts flattened into one very hard to grok pancake.

Every response to question in this thread that I've seen so far is overly wishy washy and philosophical. This question has a simple, concrete answer.

The difference between Bayesians and Frequentists is in the loss function that they attempt to minimize.

Bayesian loss functions assume a constant dataset and sums across one's hypothesis set.

Frequentist loss functions assume a constant hypothesis and sum over across possible datasets.

https://en.wikipedia.org/wiki/Loss_function

Really though this is false dichotomy, as it's perfectly possible to be both a Bayesian and a Frequentist by using a loss function which sums over both one's hypothesis set and across possible datasets.

More often than not, a good explanation is very simple, if not trivial. The way I see it, the two points of view are exactly that - points of view. They are both valid approximations to what happens in reality, and there are areas in which one of them works better than the other. The whole controversy looks to me ridiculously similar to the one described in Gulliver's Travels where philosophers were engaged in an endless argument about from which side to crack an egg.

From a more technical perspective all this comes down to a simple fact that some consider probabilities within the framework of Information Theory, while others prefer to use a standalone axiomatic foundation.

- Where did this difference come from? When did it develop?

I'm not sure of the specifics but it (a) appears to be a fundamental dichotomy on ways to practice "finding a good model" given statistical mathematical foundations and (b) has been heavily politicized historically.

A lot of statistical historical practice is developing a good general purpose way of finding a good statistical model and proving that it works pretty well under some assumptions. Historically, Bayesian methods were considered taboo (perhaps because we generally lacked the ability to compute them) and so most papers were Frequentist. Very historically (Gauss) Bayesian methods were often used to generate some of the first statistical models used in physics.

- What are the basic premises that a Bayesian believes that a frequentist doesn't, and vice versa? Reason it all the way through front and back.

In basic mathematics both sides share the same beliefs, but in practice they favor different means to construct and evaluate models. See my other answer for many more details, but essentially Frequentists evaluate their models by seeing how much they diverge from reality and Bayesians evaluate them by comparing relative likelihoods of models given what they observe. This leads to wide variations in the means of constructing, elaborating, and talking about models.

- What does the B/F's model look like? What are the pieces they use, how are they arranged, what are the dependencies, how does causality flow?

A Frequentist's model can be literally anything. You might legitimately consider "the minute of the day that the mailman arrived" an estimator for "the expected time when stock A will beat out stock B three months from now" and then you use Frequentist methods to evaluate how your estimator performs. You'll also likely conclude that this estimator is terrible.

A Bayesian's model typically flows from a "generative story" which results in a massively parameterizable model which covers a huge swath of potential realities and then the Bayesian goes looking in that space for the "most probable" model.

Frequentists can use Bayesian methods if they like. Bayesians can evaluate their "most probable" models using Frequentist evaluations if they like. Good statisticians do all of the above.

- Why are the choices made by one invalid for the other's model? Where do they agree deliberately despite this?

We both want to travel from Boston to SF. I fly and get there quickly, you drive and have a great road trip. We both arrive at approximately the same place but our methods and experiences differ. For sufficiently short trips they're even identical.

More to the point, Frequentists and Bayesians disagree about their mechanisms for getting to good models. Really dogmatic Frequentists and Bayesians can disagree about "the meaning of probability" but as far as I'm concerned this has much more to do with decision theoretic policy making and education rather than mathematics.

- What are the consequences in the real world? Give me a real example on why this difference matters? "Real" meaning I don't care about dice, I care about engineering and science.

Lets say you want to model an engineering problem statistically. Frequentist methods will probably end up requiring some leaps of logic and clever tricks to get to the best result but they will also end up with at least a few algorithms you could run on constrained hardware. Bayesian methods will be easier to "plug and chug" in many parts (though they still require a lot of finesse) but the final result will almost invariably require a fast computer.

I'd compare it to integration. One school of thought is that if you're pretty clever you can integrate many things by exploiting their structure to find the antiderivative. Another school of thought is "I can answer most practical questions here through numerical integration at the end of the day, so why both finding an antiderivative?"

Both work essentially but you face very different challenges on each road and some problems can be much easier for one perspective or the other. If you're really good you have both of these tools in your toolbelt and think carefully about when to pull each one out

> understanding the nuances between words like probability and likelihood

Go into a lab, do an experiment, call that one trial, measure a number, call that number (the value at this trial of) random variable X.

We might want the average value, expected value, or expectation of X denoted by E[X].

Under meager assumptions, if we take a sequence of independent samples of X, then their average will converge to E[X]; this is the law of large numbers.

We might be interested in the event, call it A, when X > 1.

We might want the probability of A, that is, P(A) = P(X > 1).

For random variable X, we can define its cumulative distribution: For real number x,

F_X(x) = P(X <= x)

[Here are using TeX notation where F_X is F with a subscript X.]

Then with calculus and meager assumptions, the probability density of X is

f_X(x) = d/dx F_X(x)

With meager assumptions, calculus and f_X(x) can give us the expectation E[X].

The likelihood of X = x is just f_X(x), that is, the value of the density at x.

For the Gaussian distribution, the maximum likelihood is at the central peak of the density which is also the expectation.

In some approaches to statistical estimation, we have some data and seek estimate x that maximizes the likelihood of getting the data we actually did get.

Given events A and B, we can define the conditional probability of event A given event B by

P(A|B) = P(A and B) / P(B)

So, if we think of events as geometric regions and their probabilities as their areas (actually part of a serious approach), then P(A|B) is the fraction of B that is also A.

Then

P(A|B) = P(A and B) / P(B)

is Bayes Rule.

If we do experiments and believe from whatever prior to the experiment that we have a meaningful estimate of P(B) or P(A|B), then maybe we are being Bayesian.

More generally knowing that event B occurred we can regard that as information we have obtained, and what that information says about event A is just P(A|B).

Then, if events A and B are independent, event B gives us no more information about event A and we have

P(A|B) = P(A)

So, if we are interested in event A and its probability P(A) and suddenly are told that event B occurred, then for event A we now want the updated view P(A|B).

Using the measure theory foundations of probability and the Radon-Nikodym theorem of measure theory, under meager assumptions we can define for random variables X and Y

E[Y|X]

which is a function, say, f(X), of random variable X and the best non-linear least squares estimate of Y for any function of X.

This measure theory approach also lets us define

E[Y|Z]

for an infinite set Z of random variables. This definition is useful, e.g., in the Poisson process where each increment of time to the next arrival is independent of all previous increments, Markov processes, a stochastic process adapted to a history, etc.

You're being downvoted because everything you say applies to frequentist statistics.
It would be nice if more HN users responded with a comment making that point rather than using anonymous downvotes...
I just answered one of the user's questions. The user's question I answered didn't get involved in frequentist versus Bayesian and, thus, neither did my answer.

I started my post by quoting the user's question; that's the question I answered.

I never used the word frequentist and made only minimal use of the word Bayesian. I avoided all political fights.

Note: Possibly of special interests to Bayesian, I touched on E[X|Z] for an infinite collection of random variables Z. This will be important in conditioning (the core of Bayesian) in statistics of stochastic processes.

Also of interest in conditioning, I did mention that if events A and B are independent, then B gives no more information about A because

P(A|B) = P(A)

Anyone working with conditioning needs to know this concept.

More generally, I mentioned the sense in which conditioning gives the best possible non-linear least squares estimate; so, here we begin to see the power of conditioning, of which Bayes Rule is the most elementary case.

Also you may find that my touching on the Radon-Nikodym theorem is a first step to high end versions of Bayesian, e.g., the old idea of sequential testing (A. Wald) and the concepts of stopping times, optimal stopping, the strong Markov property, etc. I wrote out an earlier response, longer, I didn't post, that did go back to sigma algebras, measurability, etc. I did omit measurable selection, sufficient statistics, etc. For such concepts, the Radon-Nikodym theorem and sigma algebras are crucial, and my post may be the only one here that mentioned either.

Also, comparing my response to others, you may find that my response was comparatively clear, precise, understandable, for such a short post without poorly defined or undefined terms, correct, and from a mature view.

By the way, I hold a Ph.D. in applied math from one of the world's best and best known research universities. My research was on stochastic optimal control and passed an oral exam from a Member, US National Academy of Engineering. I've published as sole author peer-reviewed original research in mathematical statistics. Once I did a statistical estimation of expected revenue growth for the BoD of FedEx; my work got two Board representatives from investor General Dynamics to change their mind and stay and, thus, saved FedEx. I've worked in statistical consulting in finance, marketing, etc., including in computing and statistical consulting for the faculty at Georgetown University. My work in statistical power spectral estimation got my company sole source on an important contract from the US Navy. Once I did a Monte Carlo estimation of a statistical estimation I did of the survivability of the US SSBN fleet under a special scenario of global nuclear war limited to sea -- the US Navy was pleased. My work passed review from J. Keilson, a world class expert in statistics.

Maybe, instead of what I wrote, some readers were looking for something else.

Sorry some people were offended.