Hacker News new | ask | show | jobs
by jerf 1733 days ago
Speaking not to this study in particular necessarily, I strongly agree with the general point. Science has really been held back by an over-focusing on "significance". But I'm not really interested in a pile of hundreds of thousands of studies that establish a tiny effect with suspiciously-just-barely-significant results. I'm interested in studies that reveal robust results that are reliable enough to be built on to produce other results. Results of 3% variations with p=0.046 aren't. They're dead ends, because you can't put very many of those into the foundations of future papers before the probability of one of your foundations being incorrect is too large.

To the extent that those are hard to come by... Yeah! They are! Science is hard. Nobody promised this would be easy. Science shouldn't be something where labs are cranking out easy 3%/p=0.046 papers all the time just to keep funding. It's just a waste of money and time of our smartest people. It should be harder than it is now.

Too many proposals are obviously only going to be capable of turning up that result (insufficient statistical power is often obvious right in the proposal, if you take the time to work the math). I'd rather see more wood behind fewer arrows, and see fewer proposals chasing much more statistical power, than the chaff of garbage we get now.

If I were King of Science, or at least, editor of a prestigious journal, I'd want to put word out that I'm looking for papers with at least one of some sort of significant effect, or a p value of something like p = 0.0001. Yeah. That's a high bar. I know. That's the point.

"But jerf, isn't it still valuable to map out all the little things like that?" No, it really isn't. We already have every reason in the world to believe the world is drenched in 1%/p=0.05 effects. "Everything's correlated to everything", so that's not some sort of amazing find, it's the totally expected output of living in our reality. Really, this sort of stuff is still just below the noise floor. Plus, the idea that we can remove such small, noisy confounding factors is just silly. We need to look for the things that stand out from that noise floor, not spending billions of dollars doing the equivalent of listening to our spirit guides communicate to us over white noise from the radio.

17 comments

> If I were King of Science, or at least, editor of a prestigious journal, I'd want to put word out that I'm looking for papers with at least one of some sort of significant effect, or a p value of something like p = 0.0001. Yeah. That's a high bar. I know. That's the point.

And study preregistration to avoid p-hacking and incentivize publishing negative results. And full availability of data, aka "open science".

Preregistration, requirement to publish negative or null results, and full data is, arguably, the three legs of modern science. If we collectively don't enforce this, nobody is doing science, they're just fucking around and writing it down.
I like rules like these. One context where preregistration, null results, and full data are all required are clinical trials overseen by the FDA. It’s no surprise that those studies carry a lot of weight.
Also replication studies for negative or null results in addition to positive ones (we don't have either).
You do realize there is a million negative results for every one positive result? This is equally easy to game, maybe easier.
Yes, and knowing what's been tried and what has failed is important.
I think what's being pointed out is that "researchers" could pump out hundreds of easy to test negatives every day if a negative result was just as incentivised.

I do agree though, negatives are just as important when the intent is to prove/disprove a meaningful hypothosis.

A negative result won't make a career. I don't think there's much danger when requiring negative results going onto a repository of over incentivising negative results. You can't mandate Nature or Cell publishes negative results.
we tried using 0.1 mL, it didn't work

we tried using 0.11 mL, it didn't work

we tried using 0.12 mL, it didn't work

we tried using 0.13 mL, it didn't work

    we tried using 0.10 mL, it didn't work
    we tried using 0.11 mL, it didn't work
    we tried using 0.13 mL, it didn't work
    we tried using 0.15 mL, it didn't work
    we tried using 0.17 mL, it didn't work
    we tried using 0.16 mL, it didn't work
    we tried using 0.18 mL, it didn't work
    we tried using 0.20 mL, it didn't work
    we tried using 0.14 mL, it didn't work
    we tried using 0.12 mL, it worked so we published
Do you want to know the ones that "didn't work" existed? Or are you happy with just the one that "worked" being written up in isolation?
If someone really tested those hypotheses, let them publish. I doubt they'll get funding so it'll be on their own dime. In practice people do run experiments like that, but they only publish the 1/4 trials that is successful.
Look to physics for how negative results should be published. There typically has to be reason to suspect some dosage range should work, in which case that sequence of studies you describe would be perfectly valid if it's within that range.
Why would someone want to game a negative result? Nobody ever becomes famous for saying my approach doesn't work. (As long as science is open, to make sure there is actually good work done by researchers before reaching this neg result.)
To have their name on a publication, which is a currency in the academic world.
I've thought about the idea of allowing people to separately publish data and analysis. Right now, data are only published if the analysis shows something interesting.

Improving the quality of measurements and data could be a rewarding pursuit, and could encourage the development of better experimental technique. And a good data set, even if it doesn't lead to an immediate result, might be useful in the future when combined with data that looks at a problem from another angle.

Granted, this is a little bit self serving: I opted out of an academic career, partially because I had no good research ideas. But I love creating experiments and generating data! Fortunately I found a niche at a company that makes measurement equipment. I deal with the quality of data, and the problem of replication, all day every day.

It would be interesting to consider how much knowledge would never have been uncovered if you were King of Science. All those subtle, barely seen interactions in nature that on further investigation turned out to be something rather special.
Such as? It would also be interesting to explore how many dead ends we wouldn't have wasted time on, and so what other things might have been discovered sooner.
Scientists aren't stupid. No one saw a paper where a predictor explained 1% of the variance in an outcome and based solely on a significant p value decided that was a great road to base an entire career on. The problem, as described by the parent comment, doesn't really exist in funding structures and the scientific literature. It does occur to some degree in media coverage of science.

One could make the case that in GWAS studies it has occured, but not because small effect sizes are inconsequential, the statistical methods just weren't able to separate grain from chaff for a while.

An allele that is responsible for 2% of the variation in disease risk might seem inconsequential, but 25 of those together can serve as a polygenic risk score that can predict disease and target treatment.

> Scientists aren't stupid. No one saw a paper where a predictor explained 1% of the variance in an outcome and based solely on a significant p value decided that was a great road to base an entire career on. The problem, as described by the parent comment, doesn't really exist in funding structures and the scientific literature.

Of course they're stupid. Everyone is stupid. That's why we have a "scientific method" and a formal discipline of logic to overcome fallacious reasoning and cognitive biases. If people weren't stupid we wouldn't need any of these disciplines to check our mistakes.

And yes, what you describe does happen all of the time. We literally just had a thread on HN about the failure of the amyloid hypothesis in Alzheimer's and the decades of work put wasted on it. Many researchers are still trying to push it as a legitimate therapeutic target despite every clinical trial to date failing spectacularly. As Planck said, science advances on funeral at a time.

Which isn't to say that small effect sizes aren't legitimate research targets either, but if you're after a a small effect size, the rigour should be scaled proportionally.

So your example of decades being wasted chasing an initial tiny effect size, all the time, was... An example of a failed mechanistic hypothesis that wasn't based on a tiny effect size.
This paper was pretty clearly pre-specified here; https://files.givewell.org/files/DWDA%202009/IPA/Masks_RCT_P...
And it was actually preregistered as well: https://osf.io/vzdh6/
The problem is that when you’re on the cusp of a new thing, unless you’re super lucky, the result will necessarily be near the noise floor. Real science is like that.

But I definitely agree it’d be nice to go back and show something is true to p=.0001 or whatever. Overwhelmingly solid evidence is truly a wonderful thing, and as you say, it’s really the only way to build a solid foundation.

When you engineer stuff, it needs to work 99.99-99.999% of the time or more. Otherwise you’re severely limited to how far your machine can go (in terms of complexity, levels of abstraction and organization) before it spends most of its time in a broken state.

I’ve been thinking about this while playing Factorio: so much of our discussion and mental modeling of automation works under the assumption of perfect reliability. If you had SLIGHTLY below 100% reliability in Factorio, the game would be a terrible grind limited to small factories. Likewise with mathematical proofs or computer transistors or self driving cars or any other kind of automation. The reliability needs to be insanely good. You need to add a bunch of nines to whatever you’re making.

A counterpoint to this is when you’re in an emergency and inaction means people die. In that case, you need to accept some uncertainty early on.

> If you had SLIGHTLY below 100% reliability in Factorio, the game would be a terrible grind limited to small factories.

I'd argue you do have <100% reliability in Factorio, and much of the game is in increasing the 9s.

Biters can wreck havok on your base. Miners contaminate your belts with the wrong types of ore, if you weren't paying enough attention near overlapping fields. Misplaced inserters may mis-feed your assemblers, reducing efficiency or leaving outright nonfunctional buildings. Misclicks can cripple large swaths of your previously working factory, ruining plenty of speedruns if they go uncaught. For later game megabase situations, you must deal with limited lifetimes as mining locations dry up, requiring you to overhaul existing systems with new routes of resources into them. As inputs are split and redirected, existing manufacturing can choke and sputter when they end up starved of resources. Letting your power plants starve of fuel can result in a small crisis! Electric miners mining coal, refineries turning oil into solid fuel, electric inserters fueling the boilers, water pumps providing the water to said boilers - these things all take power, and jump starting these after a power outage takes time you might not have if under active attack if your laser turrets are all offline as well.

But you have means of remediating much of this unreliability. Emergency fuel and water stockpiles, configuring priorities such that fuel for power is prioritized ahead of your fancy new iron smelting setup, programmable alerts for when input stockpiles run low, ammo-turrets that work without power, burner inserters for your power production's critical path will bootstrap themselves after an outage, roboports that replace biter-attacked defenses.

Your first smelting setup in Factorio will likely be a hand-fed burner miner and furnace, taking at most 50 coal. This will run out of power in minutes. Then you might use inserters to add a coal buffer. Then a belt of coal, so you don't need to constantly refill the coal buffer. Then a rail station, so you don't need to constantly hand-route entirely new coal and ore mining patches. Then you'll use blueprints and bots to automate much of constructing your new inputs. If you're really crazy, you'll experiment with automating the usage of those blueprints to build self-expanding bases...

I really considered getting into Factorio but your comment is exactly why I can’t touch it. I have certain demands upon my time that would inevitably go unmet as I fuss with factory.
Holy shit I was about to compose exactly this answer! Parent was marketing straight to my lizard brain
Yes be warned. Factorio is the most addictive game I have ever played.
Now imagine if machines got clogged 1% of the time and you had to fix them, or if items occasionally fell off conveyer belts onto other conveyor belts. The amount of redundancy and work that would create would be paralyzing, but that’s the bare minimum of recreating what goes wrong in the real world. I love factorio, but what always strikes me as most interesting is thinking about what it is you get to take for granted in one of the most complex games around.
That's a nice post and all, but none of that had anything to do with reliability. In all of those cases, those components worked exactly as designed when operating within their specification ranges (ie inserters insert when they have power).

The point is, it would be significantly more complex if things frequently failed even when "operating properly". And this happened at all levels of abstraction in a factory.

You're drawing what appear to be arbitrary distinctions between failure modes without making a good argument as to why one is a reliability issue and another is not.

My printer might jam if I feed paper crooked or poorly. My assemblers might jam if I feed incorrect components through misclicks, misplaced miners, or filled outputs.

My printer might fail from the entropy of wear and tear. My assemblers might fail from the entropy of biters attracted by generated pollution.

My printer might stall from running out of paper or a filled output tray. My assemblers might stall from running out of inputs or a filled output belt or chest.

Why is the printer arguably unreliable, but the assembler "100% reliable"?

Failures of my printer are not caused by magic faries sprinkling dice rolling pixie dust on my toner cartrige. Failures have physical causes. That factorio's assembler failures have modeled causes as well, instead of an arbitrary and magic dice roll, does not detract from those failure modes being reliability issues.

That my printer fails far less frequently than my Factorio assemblers points to my printer being more reliable than my Factorio assemblers. Your point that reliability could be even worse misses my point, which is merely that not only does Factorio already avoid the fiction of "100%" or "perfect reliability" - but that perhaps Factorio already models reliability worse than "real-life" in some aspects already.

It's still reliability, just who the whole system rather than the individual parts. The aliens breaking stuff is part of the whole system "operating properly"

I don't think it would be particularly bad for inserters inserting at slightly different speeds from each other, or occasionally destroying the item it was supposed to insert. Same with components occasionally breaking on their own.

Fine. Do it like the experimental physicists do: if you think you're on to something, refine and repeat the experiment in order to get a more robust, repeatable result.

The original sin of the medical and social sciences is failing to recognize a distinction between exploratory research and confirmatory research and behave accordingly.

The problem is that it’s really hard to get good data, ethically, in medical sciences. Something that improves outcomes by 5-10% can be really important, but trying to get a study big enough to prove it can be super expensive already.
Nobody likes being in the control group of the first working anti-aging serum...
> Nobody likes being in the control group of the first working anti-aging serum...

You only know whether it works when the study has been completed. You also only know whether the drug has (potentially) disastrous consequences when the study has been completed. Thus, I am not completely sure whether your claim holds.

You missed the working part. Success was a prerequisite to their after the fact feelings. At least some of the control group will be in old age but still alive when we know it woris. They might not know if it is infinite life (and side effects may turn it into die at 85, so some control may outlive the intervention group after the study ), but they will know on average they did worse
People opt into the study in the first place. I'm willing to bet that no one opts into the study hoping to be in the control group.
> I’ve been thinking about this while playing Factorio: so much of our discussion and mental modeling of automation works under the assumption of perfect reliability. If you had SLIGHTLY below 100% reliability in Factorio, the game would be a terrible grind limited to small factories.

So I'm making a guess here that you play with few monsters or non-aggressive monsters?

> So I'm making a guess here that you play with few monsters or non-aggressive monsters?

Aggressively building turret walls, defensive train lines, and so on very quickly pays dividends here. Particularly if you claim as much territory as you can each time you expand instead of simply defending what you've built out.

If done this way building/improving defenses and managing enemies becomes a task you maintain every so often and doesn't spill over into the reliability of your base.

Currently playing a game to minimize pollution to try to totally avoid biter attention. Surrounded by trees, now almost entirely solar with efficiency modules.
> when you’re on the cusp of a new thing, unless you’re super lucky, the result will necessarily be near the noise floor. Real science is like that.

That's not necessarily true in social sciences. When you're working with large survey datasets, many variables are significantly related. That doesn't mean these relationships are meaningful or causal, they could be due to underlying common causes, etc. (Maybe social sciences weren't included in "real science" - but there's where a lot of stats discussions focus)

Come into Bayesian land, the water is fine. The whole NHST edifice starts to seem really shaky once you stop and wonder if "True" and "False" are really the only two possible states of a scientific hypothesis. Andrew Gelman has written about this in many places, e.g. http://www.stat.columbia.edu/~gelman/research/published/aban....
> The whole NHST edifice starts to seem really shaky once you stop and wonder if "True" and "False" are really the only two possible states of a scientific hypothesis.

The root problem here is that people tend to dichotomise what are fundamentally continuous hypothesis spaces. The correct question is not "is drug A better than drug B?", it's "how much better or worse is drug A compared to drug B?". And this is an error you can do both in Bayesian and frequentist lands, though culturally the Bayesians have a tendency to work directly with the underlying, continuous hypothesis space.

That said, there are sometimes external reasons why you have to dichotomise your hypothesis space. E.g. ethical reasons in medicine, since otherwise you can easily end up concluding that you should give half your patients drug A and the other half drug B, to minimise volatility of outcomes (this situation would occur when you're very uncertain which drug is better).

Gelman et al's BDA3 has a fun exercise estimating heart-disease rates in one of the early chapters that demonstrates this issue with effect-sizes. BDA3 uses a simple frequentist model to determine heart-disease rates and shows that areas with small population sizes have heavily exaggerated heart-disease rates because of the small base population. Building a Bayesian model does not have the same issue as the prior population prevalence incorporates the small base population sizes.
It's interesting that high p-values actually seem to more conclusively state something than low p values (like p < 0.05) do.

With a high p value, you can say with some degree of certainty that your test was unable to detect any effect. Whether it was due to the lack of an effect or because your test wasn't capable of measuring it

With a low p value, you don't actually really know if you detected something interesting. It could be due to a flawed test, biases, non-causal correlations, faulty p-hacky stats, etc.

So why do we consider the latter more worthwhile when it seems to say less?

Bayesianism makes the problem much worse. Prior-hacking is easier and harder to detect than p-hacking, and Bayesianism has no way to exclude noise results at all. I'm constantly baffled when people suggest it as a solution to these problems.
> Prior-hacking is easier and harder to detect than p-hacking

But that's comparing apples to oranges. Setting a reasonable prior is akin to frequentists interpreting the effect size (including its confidence interval) in light of deep domain knowledge. To produce a good analysis using either Bayesian or frequentist methodology (or to criticise such an analysis), you have to have deep domain knowledge. There's no getting around that, and arguably the use of p-values often lets you get away with shoddy domain knowledge.

> and Bayesianism has no way to exclude noise results at all.

This statement doesn't make any sense. Bayesian methodology has plenty of mechanisms for working with and controlling noisy data (obviously, since it's one of the two key paradigms in statistics, which as a field fundamentally deals with noisy data). The precise error rates and uncertainties that are calculated are usually different from what you would use in a frequentist analysis, but most people consider this a benefit of Bayesian analysis.

> To produce a good analysis using either Bayesian or frequentist methodology (or to criticise such an analysis), you have to have deep domain knowledge. There's no getting around that, and arguably the use of p-values often lets you get away with shoddy domain knowledge.

The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise. The whole point of p-values is that they give you a way to do that without needing that complex analysis with deep domain knowledge - they're not a replacement for doing in-depth analysis, they're a way to cull the worst of the chaff before you do, the statistical-analysis equivalent of FizzBuzz. Bayesianism has no substitute for that (you can't say anything until you've defined your prior, which requires deep domain knowledge), and as such makes the problem much worse.

> (you can't say anything until you've defined your prior, which requires deep domain knowledge)

Well, you can use a non-informative prior. And that's the correct choice when you genuinely don't have a better option. But you should always be able to justify that, and that in turn requires deep domain knowledge....which leads me to....

> The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise.

....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with. Bad p-values are worse than none, since you have no knowledge of what error rate they actually achieve in the long-run.

> Bayesianism has no substitute for that

Yes it does. It's called Bayes factors. But as I said above, I completely disagree with your view of what a p-value is for.

> Well, you can use a non-informative prior. And that's the correct choice when you genuinely don't have a better option.

At which point you've just found a more cumbersome way to do frequentist statistics. Frequentist tools aren't inconsistent with Bayes' law (they can't be, since both are valid theorems) - indeed one could say that the whole project of frequentist statistics consists of building a well-understood suite of pre-baked priors and computations that are appropriate to situations that are commonly encountered.

> ....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with.

Well, the fact is there are too many small-sample studies being produced for all or even most of them to be critically analysed by people with deep understanding. And maybe the right fix for the problem is to give the right incentives for that kind of critical analysis (e.g. by allowing that kind of analysis to count as research for the purposes of journal publications and PhD theses just as much as "the original study" does, given that a study without that kind of critical analysis cannot truly be said to represent advancing human knowledge). But if you just tell people to do Bayesian analysis instead of frequentist analysis then that's not going to magically create deep understanding - rather people will try to replace shallow frequentist analysis with shallow Bayesian analysis, and shallow Bayesian analysis is a lot less effective and more hackable.

> Yes it does. It's called Bayes factors.

But you still need a prior to compute a Bayes factor.

Bayesian reasoning has even worse underpinnings. You don’t actually know any of the things the equations want. For example suppose a robot is counting Red and Blue balls from a bin, the count is 400Red and 637Blue, it just classified a Red ball.

Now what’s the count, wait what’s the likelihood it misclassified a ball? How accurate are those estimates, and those estimates of those ...

For a real world example someone using Bayesian reasoning when counting cards should consider the possibility that the deck doesn’t have the correct cards. And the possibility that the decks cards have been changed over the course of the game.

Huh? You can derive all of those from Bayesian models. If you're counting balls from a bin with replacement, and your bot has counted 400Red with 637Blue, you have a Beta/Binomial model. That means you p_blue | data ~ Beta(401, 638) assuming a Uniform prior. The probability of observing a red ball given the above p_blue | data is P(red_obs | p_blue) = 1 - P(blue_obs | p_blue), which is calculable from p_blue | data. In fact in this simple example you can even analytically derive all of these values, so you don't even need a simulation!
And if misclassification is a concern (as the parent mentioned) you can put a prior on that rate too!
Which rate? The rate you failed to mix the balls? The rate you failed to count a ball? The rate you misclassified the ball? The rate you repeatedly counted the same ball? The rate you started with an incorrect count? The rate you did the math wrong? etc

Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions. Black swans can really throw a wrench into things.

Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.

> Which rate? The rate you failed to mix the balls? The rate you failed to count a ball? The rate you misclassified the ball? The rate you repeatedly counted the same ball? The rate you started with an incorrect count? The rate you did the math wrong? etc

This is called modelling error. Both Bayesian and frequentist approaches suffer from modelling error. That's what TFA talks about when mentioning the normality assumptions behind the paper's GLM. Moreover, if errors are additive, certain distributions combine together easily algebraically meaning it's easy to "marginalize" over them as a single error term. In most GLMs, there's a normally distributed error term meant to marginalize over multiple i.i.d normally distributed error terms.

> Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.

I don't understand the point of your argument. Please clarify it.

> Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions.

What does this mean, concretely? Run me through an example of the problem you're bringing up. Are you saying that posterior-predictive distributions are "bogus" because they're based on prior distributions? Why? They're just based on the application of Bayes Law.

> Black swans can really throw a wrench into things

A "black swan" as Taleb states is a tail event, and this sort of analysis is definitely performed (see: https://en.wikipedia.org/wiki/Extreme_value_theory). In the case of Bayesian stats, you're specifically calculating the entire posterior distribution of the data. Tail events are visible in the tails of the posterior predictive distribution (and thus calculable) and should be able to tell you what the consequences are for a misprediction.

Suppose the likelihood it missclassified a ball is significantly different from zero, but not yet known precisely.

If you use a model that doesn't ask you to think about this likelihood at all, you will get the same result as if you had used bayes and consciously chose to approximate the likelihood of misclassification as zero.

You may get slightly better results if you have a reasonnable estimate of that probability, but you will get no worse if you just tell Bayes zero.

It feels like you're criticizing the model for asking hard questions.

I feel like explicitely not knowing an answer is always a small step ahead of not considering the question.

The criticism is important because of how Bayes keeps using the probability between experiments. Garbage in Garbage out.

As much as people complain about frequentist approaches, examining the experiment independently from the output of the experiment effectively limits contamination.

Can't you just add that to your equation? Seems like for anything real, this will not go many levels deep at all before it's irrelevent.
Don't get distracted by the click bait title. Effect size should be captured by statistical significance (larger effects are less likely to happen by chance). Author is really complaining that the original study didn't report enough data to check their analysis or do alternative analysis methods. Better title for article would be "Hard to peer review when you don't share the data"
Note the point in the essay that statistical significance is meaningless if the model does not correspond to reality — which, in this case as in many, they very much do not.
A few years ago, HN comments complained about the censorship that only leaves successful studies. We need to report on everything we've tried, so we don't walk around on donuts.

What's missing in my mind is admitting that results were negative. I'm reading up on financial literacy, and many studies end with some metrics being "great" at p 5%, but then some other metrics are also "great" at p 10%, without the author ever explaining what they would have classified as bad. They're just reported without explanation of what significance they would expect (in their field).

> ...so we don't walk around on donuts

I agree with what you're saying, but I don't understand this phrase.

I don't know where that turn of phrase comes from, but I imagine it's synonymous with 'walking around in circles'.
The phrase "walk around on donuts" has one Google result and it's this thread.
You know how sometimes you'll accidentally step on a donut and you'll have to call your dog over to lick all the jelly off your toes? That.
Not only is it not valuable to publish tons of studies with p=.04999 and small effect size, in fact it's harmful. With so many questionable results published in supposedly reputable places it becomes possible to "prove" all sorts of crackpot theories by selectively citing real research. And if you try to dispute the studies you can get accused of being anti-science.
Only a problem for people who are trying hard not to think. You can just ignore those people. They're not doing any harm believing their beliefs.
The USDA food pyramid and nutrition education would suggest that there's an inherent danger in just letting people believe irrational things after a correction is known. It depends on the belief - flat earth people aren't likely to cause any harm. Bad nutrition information can wreak havoc at scale.
Flat earth beliefs doesn't cause harm, but flat earth believers have largely upgraded to believing more dangerous nonsense.
Data or it didn't happen. This really sounds like you're inventing a caricature of your enemy and assigning them "dangerous" qualities so you can hate them more.
Nobody needs to caricature the insane beliefs surrounding COVID (or flat earth), people holding them are doing a good enough job of that themselves.

I do have a few favorites. "COVID tests give you COVID, so I won't go get tested" is certainly up there. I can't say I give two figs about your opinion on the Earth's topology, but this one is a public health problem, that's crippling hospitals around the country.

We are literally in the middle of a global crisis that is founded on people misunderstanding science.
What on earth are you talking about? I guess climate change but that's certainly not founded on people misunderstanding science, it's caused by people understanding science which led to industrialization. Or maybe you mean covid-19? Neither that. You're just trying to make it seem like it's somehow very serious and bad if everyone doesn't agree with you. It's not.
I’ll presume you’re referring to everyone involved in the gain of function research that led to the virus.
I blame most of this on pop science. It's absolutely ruined the average public's respect for the behind the scenes work doing interesting stuff in every field. What's worse is the attitude it breeds. Anti-intellectualism runs rampant amongst even well educated members of my social circle. It's frustrating to say the least.
Some say that it is not anti-intellectualism to realize the emperor has no clothes but enlightenment.

Either way it’s dangerous.

It can be both, but you're absolutely right.
"Believe the science" vs. "understand the process". The former merely uses the language of science to gain legitimacy.
> Plus, the idea that we can remove such small, noisy confounding factors is just silly. We need to look for the things that stand out from that noise floor

We have found most of them, and all the easy ones. Today the interesting things are near the noise floor. 3000 years ago atoms were well below the noise floor, now we know a lot about them - most of it seems useless in daily life yet a large part of the things we use daily depend on our knowledge of the atom.

Science needs to keep separating things from the noise floor. Some of them become important once we understand it.

I don't think we have found most of them. I think we make it look like we've found most of them because we keep throwing money at these crap studies.

Bear in mind that my criteria are two-dimensional, and I'll accept either. By all means, go back and establish your 3% effect to a p-value of 0.0001. Or 0.000000001. That makes that 3% much more interesting and useful.

It'll especially be interesting and valuable when you fail to do so.

But we do not, generally, do that. We just keep piling up small effects with small p-values and thinking we're getting somewhere.

Further, if there is a branch of some "science" that we've exhaused so thoroughly that we can't find anything that isn't a 3%/p=0.047 effect anymore... pack it in, we're done here. Move on.

However, part of the reason I so blithely say that is that I suspect if we did in fact raise the standards as I propose here, it would realign incentives such that more sciences would start finding more useful results. I suspect, for instance, that a great deal of the soft sciences probably could find some much more significant results if they studied larger groups of people. Or spent more time creating theories that aren't about whether priming people with some sensitive word makes them 3% more racist for the next twelve minutes, or some other thing that even if true really isn't that interesting or useful as a building block for future work.

So 3% is not interesting but the difference between 10^-7 and 10^-8 probability that there is no effect is interesting somehow?
Meta analysis after enough small studies show the effect exists.
Individual atoms, or small numbers of them, may be beneath some noise floor, but not combined atoms.

A salt crystal (Lattice of NaCl atoms) is nothing like a pure gold nugget (clump of Au atoms).

That difference is a massive effect.

So to begin with, we have this sort of massive effect which requires an explanation, which is where atoms then come in.

Maybe the right language here is not that we need an effect rather than statistical significance, but that we need a clear, unmistakable phenomenon. There has to be a phenomenon, which is then explained by research. Research cannot be inventing the phenomenon by whiffing at the faint fumes of statistical significance.

> We have found most of them, and all the easy ones. Today the interesting things are near the noise floor.

The noise floor is not static. A major theoretical advance spurs an advance in instrumentation, which then supports more science. The hypothesis space is usually much larger than the data space, making the bottleneck theory, not data. The "end of progress" has been lamented again and again since before Galileo, only to be upended by a paradigm shifting theory that paved the way for lots of new science. Many of these theories were developed long after the data and instruments were available, and were produced with relatively simple data: Young's double slit experiment, Mendelian genetics, the photoelectric effect, Brownian motion, most of classical mechanics, quantum teleportation, BOLD MRI, etc.

Doesn't it make a difference if it's near the noise floor because it's hard to measure (atoms) or if it's near the noise floor because it's hardly there (masks)? Maybe if these "hardly there" results led to further research that isolated some underlying "very there" phenomena, they would be important, but until that happens, who cares if thinking about money makes you slightly less generous than thinking about flowers? If they're not building on previous research to discover more and more important things, then it doesn't seem like useful progress.
> or a p value of something like p = 0.0001

This has been proposed [0], albeit for a threshold of p < 0.005.

Here's Andy Gelman and others arguing otherwise [1]. They also got like 800 scientists to sign on to the general idea of no longer using statistical significance at all [2].

[0] https://www.nature.com/articles/s41562-017-0189-z

[1] http://www.stat.columbia.edu/~gelman/research/unpublished/ab...

[2] https://www.nature.com/articles/d41586-019-00857-9

Given the (estimated) number of scientists in the world and their general propensity to sign on to something… is 800 scientists a significant amount?
I don't know, and it's a fair point; I think I should have just summarised the follow-up article as "here's some follow-up by the authors for context"
This is clearly a cost/benefit tradeoff, and the sweet spot will depend entirely on the field. If you are studying the behavior of heads of state, getting an additional N is extremely costly, and having a p=0.05 study is maybe more valuable than having no published study at all, because the stakes are very high and even a 1% chance of (for example) preventing nuclear war is worth a lot. On the other hand, if you are studying fruit flies, an additional N may be much cheaper, and the benefit of yet another low effect size study may be small, so I could see a good argument being made for more stringent standards. In fact I know that in particle physics the bar for discovery is much higher than p=0.05.
What if it's the other way round and a p<0.05 study says that the best way to make sure a rival country does not do a nuclear strike on you first is to do a massive nuclear strike on them first?
Nothing is wrong with publishing small effect size results. Setting a P threshold lower or a a higher bar for effect sizes for journal acceptance will just increase the positivity bias and also encourage more dodgy practices. Null results are important.

Understanding effect size is as important as significance can manifest by requiring effect size or variance explained to be reported every time the result of a statistical test is presented, e.g. rather than simply "a significant increase was observed (p = 0.01)" and also making that kind of parsing the standard in scientific journalism.

If you were the king of science, I'd kindly ask you to think about replacing grant financing and all other financial incentives that go along with publishing. Now that would be efficient. 'Cause I currently make .05-barely-significant-results but if you force me to up my game I will provide .0001-barely-significant-results no problem, even with 'preregistration' or whatever hoop you hold in front of me.

As an aside, could you also please make medicine a real science, so I can finally scientifically demonstrate that my boss is wrong?

What do you (or anyone else) think about the statistical conclusions in this paper? Particularly the adjusted r-squared values reported.

https://www.cambridge.org/core/journals/american-political-s...

The current science economy around publishing is partially responsible, although it should also be said that finding no correlation is still a gain of knowledge that is valuable to build upon for people in the same field, even if it might not generate the most exciting read for others.
I agree we shouldn't listen to noise, but small effect size is not necessarily noise. (I agree it is highly correlated.) I mean, QED's effect size on g factor is 1.001. QED was very much worth finding out.
Maybe all studies should be preregistered, including their methods... like this one was?

https://osf.io/vzdh6/

p = 0.0001 doesn't help much. You can get to an arbitrarily small p by just using more data. The problem is trying to reject a zero width null hypothesis. Scientists should always reject something bigger than infinitesimally small so that they are not catching tiny systematic biases in their experiments. There are always small biases.

Gwern's page "Everything Is Correlated" is worth reading: https://www.gwern.net/Everything

It would at least filter out the social science experiments where results on 30 college students is "significant" at p=.04 (and it's too expensive to recruit 3000 of them to force significance).