Come into Bayesian land, the water is fine. The whole NHST edifice starts to seem really shaky once you stop and wonder if "True" and "False" are really the only two possible states of a scientific hypothesis. Andrew Gelman has written about this in many places, e.g. http://www.stat.columbia.edu/~gelman/research/published/aban....
> The whole NHST edifice starts to seem really shaky once you stop and wonder if "True" and "False" are really the only two possible states of a scientific hypothesis.
The root problem here is that people tend to dichotomise what are fundamentally continuous hypothesis spaces. The correct question is not "is drug A better than drug B?", it's "how much better or worse is drug A compared to drug B?". And this is an error you can do both in Bayesian and frequentist lands, though culturally the Bayesians have a tendency to work directly with the underlying, continuous hypothesis space.
That said, there are sometimes external reasons why you have to dichotomise your hypothesis space. E.g. ethical reasons in medicine, since otherwise you can easily end up concluding that you should give half your patients drug A and the other half drug B, to minimise volatility of outcomes (this situation would occur when you're very uncertain which drug is better).
Gelman et al's BDA3 has a fun exercise estimating heart-disease rates in one of the early chapters that demonstrates this issue with effect-sizes. BDA3 uses a simple frequentist model to determine heart-disease rates and shows that areas with small population sizes have heavily exaggerated heart-disease rates because of the small base population. Building a Bayesian model does not have the same issue as the prior population prevalence incorporates the small base population sizes.
It's interesting that high p-values actually seem to more conclusively state something than low p values (like p < 0.05) do.
With a high p value, you can say with some degree of certainty that your test was unable to detect any effect. Whether it was due to the lack of an effect or because your test wasn't capable of measuring it
With a low p value, you don't actually really know if you detected something interesting. It could be due to a flawed test, biases, non-causal correlations, faulty p-hacky stats, etc.
So why do we consider the latter more worthwhile when it seems to say less?
Bayesianism makes the problem much worse. Prior-hacking is easier and harder to detect than p-hacking, and Bayesianism has no way to exclude noise results at all. I'm constantly baffled when people suggest it as a solution to these problems.
> Prior-hacking is easier and harder to detect than p-hacking
But that's comparing apples to oranges. Setting a reasonable prior is akin to frequentists interpreting the effect size (including its confidence interval) in light of deep domain knowledge. To produce a good analysis using either Bayesian or frequentist methodology (or to criticise such an analysis), you have to have deep domain knowledge. There's no getting around that, and arguably the use of p-values often lets you get away with shoddy domain knowledge.
> and Bayesianism has no way to exclude noise results at all.
This statement doesn't make any sense. Bayesian methodology has plenty of mechanisms for working with and controlling noisy data (obviously, since it's one of the two key paradigms in statistics, which as a field fundamentally deals with noisy data). The precise error rates and uncertainties that are calculated are usually different from what you would use in a frequentist analysis, but most people consider this a benefit of Bayesian analysis.
> To produce a good analysis using either Bayesian or frequentist methodology (or to criticise such an analysis), you have to have deep domain knowledge. There's no getting around that, and arguably the use of p-values often lets you get away with shoddy domain knowledge.
The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise. The whole point of p-values is that they give you a way to do that without needing that complex analysis with deep domain knowledge - they're not a replacement for doing in-depth analysis, they're a way to cull the worst of the chaff before you do, the statistical-analysis equivalent of FizzBuzz. Bayesianism has no substitute for that (you can't say anything until you've defined your prior, which requires deep domain knowledge), and as such makes the problem much worse.
> (you can't say anything until you've defined your prior, which requires deep domain knowledge)
Well, you can use a non-informative prior. And that's the correct choice when you genuinely don't have a better option. But you should always be able to justify that, and that in turn requires deep domain knowledge....which leads me to....
> The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise.
....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with. Bad p-values are worse than none, since you have no knowledge of what error rate they actually achieve in the long-run.
> Bayesianism has no substitute for that
Yes it does. It's called Bayes factors. But as I said above, I completely disagree with your view of what a p-value is for.
> Well, you can use a non-informative prior. And that's the correct choice when you genuinely don't have a better option.
At which point you've just found a more cumbersome way to do frequentist statistics. Frequentist tools aren't inconsistent with Bayes' law (they can't be, since both are valid theorems) - indeed one could say that the whole project of frequentist statistics consists of building a well-understood suite of pre-baked priors and computations that are appropriate to situations that are commonly encountered.
> ....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with.
Well, the fact is there are too many small-sample studies being produced for all or even most of them to be critically analysed by people with deep understanding. And maybe the right fix for the problem is to give the right incentives for that kind of critical analysis (e.g. by allowing that kind of analysis to count as research for the purposes of journal publications and PhD theses just as much as "the original study" does, given that a study without that kind of critical analysis cannot truly be said to represent advancing human knowledge). But if you just tell people to do Bayesian analysis instead of frequentist analysis then that's not going to magically create deep understanding - rather people will try to replace shallow frequentist analysis with shallow Bayesian analysis, and shallow Bayesian analysis is a lot less effective and more hackable.
> Yes it does. It's called Bayes factors.
But you still need a prior to compute a Bayes factor.
> At which point you've just found a more cumbersome way to do frequentist statistics.
Hmm, in one way, yes...but on the other hand, Bayesian posteriors are a lot more intuitive to interpret, for most people. So I think you trade one form of convenience for another. But as you sort of hint at, the results should usually be fairly similar, whether you're doing frequentist or Bayesian analysis. So in most cases, I doubt it matters that much. Where it does matter, is when you have grounds for strong priors, that you want to take advantage of. In such cases you can improve your chances of being correct in the "here and now", if you do a Bayesian analysis. Whereas a frequentist analysis is only concerned with the asymptotic error rates. (but of course frequentist vs Bayesian is also a ladder, rather than a black and white distinction)
> Well, the fact is there are too many small-sample studies being produced for all or even most of them to be critically analysed by people with deep understanding.
And this I totally agree with. If there's one thing I dislike about academia, it's the tendency to fund low-powered studies that get nowhere. Better to go all in, with sufficient support from experienced people, in fewer and bigger studies.
Bayesian reasoning has even worse underpinnings. You don’t actually know any of the things the equations want. For example suppose a robot is counting Red and Blue balls from a bin, the count is 400Red and 637Blue, it just classified a Red ball.
Now what’s the count, wait what’s the likelihood it misclassified a ball? How accurate are those estimates, and those estimates of those ...
For a real world example someone using Bayesian reasoning when counting cards should consider the possibility that the deck doesn’t have the correct cards. And the possibility that the decks cards have been changed over the course of the game.
Huh? You can derive all of those from Bayesian models. If you're counting balls from a bin with replacement, and your bot has counted 400Red with 637Blue, you have a Beta/Binomial model. That means you p_blue | data ~ Beta(401, 638) assuming a Uniform prior. The probability of observing a red ball given the above p_blue | data is P(red_obs | p_blue) = 1 - P(blue_obs | p_blue), which is calculable from p_blue | data. In fact in this simple example you can even analytically derive all of these values, so you don't even need a simulation!
Which rate? The rate you failed to mix the balls? The rate you failed to count a ball? The rate you misclassified the ball? The rate you repeatedly counted the same ball? The rate you started with an incorrect count? The rate you did the math wrong? etc
Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions. Black swans can really throw a wrench into things.
Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.
> Which rate? The rate you failed to mix the balls? The rate you failed to count a ball? The rate you misclassified the ball? The rate you repeatedly counted the same ball? The rate you started with an incorrect count? The rate you did the math wrong? etc
This is called modelling error. Both Bayesian and frequentist approaches suffer from modelling error. That's what TFA talks about when mentioning the normality assumptions behind the paper's GLM. Moreover, if errors are additive, certain distributions combine together easily algebraically meaning it's easy to "marginalize" over them as a single error term. In most GLMs, there's a normally distributed error term meant to marginalize over multiple i.i.d normally distributed error terms.
> Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.
I don't understand the point of your argument. Please clarify it.
> Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions.
What does this mean, concretely? Run me through an example of the problem you're bringing up. Are you saying that posterior-predictive distributions are "bogus" because they're based on prior distributions? Why? They're just based on the application of Bayes Law.
> Black swans can really throw a wrench into things
A "black swan" as Taleb states is a tail event, and this sort of analysis is definitely performed (see: https://en.wikipedia.org/wiki/Extreme_value_theory). In the case of Bayesian stats, you're specifically calculating the entire posterior distribution of the data. Tail events are visible in the tails of the posterior predictive distribution (and thus calculable) and should be able to tell you what the consequences are for a misprediction.
You don’t find black swans from the data you find them from building better models. You can look at 100 years of local flood and weather data to build up a flood assessment, but that’s not going to include mudslides or earthquakes etc. The same applies to studies.
My point is this: You can’t combine them using Bayesian statistics adjusting for the possibility of research fraud it’s simply not in the data.
Their great for well understood domains, less so for research. Frequentist models don’t work, but they also don’t even try.
PS: Math errors don’t really fall into modeling error.
I wouldn't think of Black Swan events as tail events, so much as model failures or regime-changes. As in, 'we modeled this as a time-invariant gaussian distribution, but it's actually a mixture model where the second hidden mode was triggered in the aftermath of an asteroid strike that we didn't model for, because of course we didn't.'
In re, the arguey-person you were responding to, frequentist modeling is just as bad or worse for these sorts of situations.
Suppose the likelihood it missclassified a ball is significantly different from zero, but not yet known precisely.
If you use a model that doesn't ask you to think about this likelihood at all, you will get the same result as if you had used bayes and consciously chose to approximate the likelihood of misclassification as zero.
You may get slightly better results if you have a reasonnable estimate of that probability, but you will get no worse if you just tell Bayes zero.
It feels like you're criticizing the model for asking hard questions.
I feel like explicitely not knowing an answer is always a small step ahead of not considering the question.
The criticism is important because of how Bayes keeps using the probability between experiments. Garbage in Garbage out.
As much as people complain about frequentist approaches, examining the experiment independently from the output of the experiment effectively limits contamination.
The root problem here is that people tend to dichotomise what are fundamentally continuous hypothesis spaces. The correct question is not "is drug A better than drug B?", it's "how much better or worse is drug A compared to drug B?". And this is an error you can do both in Bayesian and frequentist lands, though culturally the Bayesians have a tendency to work directly with the underlying, continuous hypothesis space.
That said, there are sometimes external reasons why you have to dichotomise your hypothesis space. E.g. ethical reasons in medicine, since otherwise you can easily end up concluding that you should give half your patients drug A and the other half drug B, to minimise volatility of outcomes (this situation would occur when you're very uncertain which drug is better).