Hacker News new | ask | show | jobs
by eeeeeeehio 530 days ago
Peer review is not designed for science. Many papers are not rejected because of an issue with the science -- in fact, reviewers seldom have the time to actually check the science! As a CS-centric example: you'll almost never find a reviewer who reads a single line of code (if code is submitted with the paper at all). There is artifact review, but this is never tied to the acceptance of the paper. Reviewers focus on ideas, presentation, and the presented results. (And the current system is a good filter for this! Most accepted papers are well-written and the results always look good on paper.) However, reviewers never take the time to actually verify that the experiment code matches the ideas described in the paper, and that the results reproduce. Ask any CS/engineering PhD student how many papers (in top venues) they've seen with a critical implementation flaw that invalidates the results -- and you might begin to understand the problem.

At least in CS, the system can be fixed, but those in power are unable and unwilling to fix it. Authors don't want to be held accountable ("if we submit the code with the paper -- someone might find a critical bug and reject the paper!"), and reviewers are both unqualified (i.e. haven't written a line of code in 25 years) and unwilling to take on more responsibility ("I don't have the time to make sure their experiment code is fair!"). So we are left with an obviously broken system where junior PhD students review artifacts for "reproducibility" and this evaluation has no bearing whatsoever on whether a paper gets accepted. It's too easy to cook up positive results in almost any field (intentionally, or unintentionally), and we have a system with little accountability.

It's not "the best we have", it's "the best those in power will allow". Those in power do not want consequences for publishing bad research, and also don't want the reviewing load required to keep bad research out.

2 comments

This is much too negative. Peer review indeed misses issues with papers, but by-and-large catches the most glaring faults.

I don’t believe for one moment that the vast majority of papers in reputable conferences are wrong, if only for the simple reason that putting out incorrect research gives an easy layup for competing groups to write a follow-up paper that exposes the flaw.

It’s also a fallacy to state that papers aren’t reproducible without code. Yes code is important, but in most cases the core contribution of the research paper is not the code, but some set of ideas that together describe a novel way to approach the tackled problem.

I spent a chunk of my career working on productionizing code from ML/AI papers and huge part of them are outright not reproducible.

Mostly they lack critical information (missing chosen constants in equations, outright missing information on input preparation or chunks of "common knowledge algorithms"). Those that don't have measurements that outright didn't fit the reimplemented algorithms or only succeeded in their quality on the handpicked, massaged dataset of the author.

It's all worse than you can imagine.

That’s the difference between truly new approaches to modelling an existing problem, or coming up with a new problem. No set of a bit different results or missing exact hyperparameter settings really invalidates the value of the aforementioned research. If the math works, and is a nice new point of view, its good. It may not even help anyone with practical applications right now, but may inspire ideas further down the line that do make the work practicable, too.

In contrast, if the main value of a paper is a claim that they increase performance/accuracy in some task by x%, then its value can be completely dependent on whether it actually is reproduceable.

Sounds like you are complaining about the latter type of work?

> No set of a bit different results or missing exact hyperparameter settings really invalidates the value of the aforementioned research.

If this is the case, the paper should not include a performance evaluation at all. If the paper needs a performance evaluation to prove its worth, we have every right to question the way that evaluation was conducted.

I don't think theres much value in theoretical approaches that lack important derivation data either, so no need to try to split the papers like this. The academic CS publishing is flooded with bad quality papers in any case.
I spent 3 months implementing a paper once. Finally, I got to the point where I understood the paper probably better than the author. It was an extremely complicated paper (homomorphic encryption). At this point, I realized that it doesn't work. There was nothing about it that would ever work, and it wasn't for lack of understanding. I emailed the author asking to clarify some specific things in the paper, they never responded.

In theory, the paper could work, but it would be incredibly weak (the key turned out to be either 1 or 0 -- a single bit).

Do you have a link to the paper?
+1
Anecdotally it is not. Most papers in CS I have read have been bad and impossible to reproduce. Maybe I have been unlucky but my experience is sadly the same.
> by-and-large catches the most glaring faults.

I did not dispute that peer review acts as a filter. But reviewers are not reviewing the science, they are reviewing the paper. Authors are taking advantage of this distinction.

> if only for the simple reason that putting out incorrect research gives an easy layup for competing groups to write a follow-up paper that exposes the flaw.

You can’t make a career out of exposing flaws in existing research. Finding a flaw and showing that a paper from last year had had cooked results gets you nowhere. There’s nowhere to publish “but actually, this technique doesn’t seem to work” research. There’s no way for me to prove that the ideas will NEVER work —- only that their implementation doesn’t work as well as they claimed. Authors who claim that the value is in the ideas should stick to Twitter, where they can freely dump all of their ideas without any regard for whether they will work or not.

And if you come up with another way of solving the problem that actually works, it’s much harder to convince reviewers that the problem is interesting (because the broken paper already “solved” it!)

> in most cases the core contribution of the research paper is not the code, but some set of ideas that together describe a novel way to approach the tackled problem

And this novel approach is really only useful if it outperforms existing techniques. “We won’t share the code but our technique works really well we promise” is obviously not science. There is a flood of papers with plausible techniques that look reasonable on paper and have good results, but those results do not reproduce. It’s not really possible to prove the technique “wrong”, but the burden should be on the authors to provide proof that their technique works and on reviewers to verify it.

It’s absurd to me that mathematics proofs are usually checked during peer review, but in other fields we just take everyone at their word.

They aren’t necessarily wrong but most are nearly completely useless due to some heavily downplayed or completely omitted flaw that surfaces when you try to implement the idea in actual systems.

There is technically academic novelty so it’s not “wrong”. It’s just not valuable for the field or science in general.

I don't think anyone is saying it's not reproducible without code, it's just much more difficult for absolutely no reason. If I can run the code of a ML paper, I can quickly check if the examples were cherry-picked, swap in my own test or training set... The new technique or idea was still the main contribution, but I can test it immediately, apply it to new problems, optimise the performance to enable new use-cases...

It's like a chemistry paper for a new material (think the recent semiconductor thing) not including the amounts used and the way the glassware was set up. You can probably get it to work in a few attempts, but then the result doesn't have the same properties as described, so now you're not sure if your process was wrong or if their results were.

More code should be released, but code is dependent on the people or environment that run it. When I release buggy code I will almost always have to spend time supporting others in how to run it. This is not what you want to do in Proof of concept to prove an idea.

I am not published but I have implemented a number of papers to code, it works fine (hashing, protocols and search mostly). I have also used code dumps to test something directly. I think I spend less time on code dumps, and if I fail I give up easier. That is the danger you start blaming the tools instead of how good you have understood the ideas.

I agree with you that more code should be released.. It is not a solution for good science though.

Sharing the code may also share the incorrect implementation biases.

It's a bit like saying that to help reproduce the experiment, the experimental tools used to reach the conclusion should be shared too. But reproducing the experiment does not mean "having a different finger clicking on exactly the same button", it means "redoing the experiment from scratch, ideally with a _different experimental setup_ so that it mitigates the unknown systematic biases of the original setup".

I'm not saying that sharing code is always bad, you give examples of how it can be useful. But sharing code has pros and cons, and I'm surprised to see so often people not understanding that.

If they don't publish the experimental setup, another person could use the exact same setup anyway without knowing. Better to publish the details so people can actually think of independent ways to verify the result.
But they will not make the same mistakes. If you ask two persons to build a software, they can use the same logic and build the same algorithm, but what are the chances they will do exactly the same bugs.

Also, your argument seems to be "_maybe_ they will use the exact same setup". So it already looks better than the solution where you provide the code and they _will for sure_ use the exact same setup.

And "publish the details" corresponds to explain the logic, not share the exact implementation.

Also, I'm not saying that sharing the code is bad, but I'm saying that sharing the code is not the perfect solution and people who thinks not sharing the code is very bad are usually not understanding what are the danger of sharing the code.

Nobody said sharing the code "is the perfect solution". Just that sharing the code is way better and should be commonplace, if not required. Your argument that not doing so will force other teams to do re-write the code seems unrealistic to me. If anyone wants to check the implementation they can always disregard the shared code, but having it allows other, less time-intensive checks to still happen: like checking for cherry-picked data, as GP suggested, looking through the code for possible pitfalls etc. Besides, your argument could be extended to any specific data the paper presents: why publish numbers so people can get lazy and just trust them? Just publish the conclusion and let other teams figure out ways to prove/disprove it! - which is (more than) a bit ridiculous, wouldn't you say?
What you're deliberately ignoring is that omitting important information is material to a lot of papers because the methodology was massaged into desired results to created publishable content.

It's really strange seeing how many (academic) people will talk themselves into bizarre explanations for a simple phenomenon of widespread results hacking to generate required impact numbers. Occams razor and all that.

If it is massaged into desired results, then it will be invalidated by facts quite easily. Inversely, obfuscating things is also easy if you just provide the whole package and just say "see, you click on the button and you get the same result, you have proven that it is correct". No providing code means that people will redo their own implementation and come back to you when they will see they don't get the same results.

So, no, no need to invent that academics are all part of this strange crazy evil group. Academics are debating and are being skeptical of their colleagues results all the time, which is already contradictory to your idea that the majority is motivated by frauding.

Occams razor is simply that there are some good reasons why code is not shared, going from laziness to lack of expertise on code design to the fact that code sharing is just not that important (or sometimes plainly bad) for reproducibility, no need to invent that the main reason is fraud.

Ok, that's a bit naive now. The whole "replication crisis" is exactly the term for bad papers not being invalidated "easily". [1]

Beacuse - if you'd been in academia - you'd find out that replicating papers isn't something that will allow you to keep your funding, your job and your path to next title.

And I'm not sure why did you jump to "crazy evil group" - noone is evil, everyone is following their incentives and trying to keep their jobs and secure funding. The incentives are perverse. This willing blindness against perverse incentives (which appears both in US academia and corporate world) is a repeated source of confusion for me - is the idea that people aren't always perfectly honest when protecting their jobs, career success and reputation really so foreign to you?

[1]:https://en.wikipedia.org/wiki/Replication_crisis

> It's not "the best we have", it's "the best those in power will allow". Those in power do not want consequences for publishing bad research, and also don't want the reviewing load required to keep bad research out.

This is a very conspiratorial view of things. The simple and true answer is your last suggestion: doing a more thorough review takes more time than anyone has available.

Reviewers work for free. Applying the level of scrutiny you're requesting would require far more work than reviewers currently do, and maybe even something approaching the amount of work required to write the paper in the first place. The more work it takes to review an article, the less willing reviewers are to volunteer their time, and the harder it is for editors to find reviewers. The current level of scrutiny that papers get at the peer-review stage is a result of how much time reviewers can realistically volunteer.

Peer review is a very low standard. It's only an initial filter to remove the garbage and to bring papers up to some basic quality standard. The real test of a paper is whether it is cited and built upon by other scientists after publication. Many papers are published and then forgotten, or found to be flawed and not used any more.

> Reviewers work for free.

If journals were operating on a shoestring budget, I might be able to understand why academics are expected to do peer review for free. As it is, it makes no sense whatsoever. Elsevier pulls down huge amounts of money and still manages to command free labor.

I think it has to be this way, right? Otherwise a paid reviewer will have obvious biases from the company.
It seems to me that paying them for their time would remove bias, rather than add it.
How is that?
I guess the sensible response is "what bias does being paid by Elsevier add that working for free for Elsevier doesn't add?"

The external bias is clear to me (maybe a paper undermines something you're about to publish, for example) but I honestly can't see much additional bias in adding cash to a relationship that already exists.

>The real test of a paper is whether it is cited and built upon by other scientists after publication. Many papers are published and then forgotten, or found to be flawed and not used any more.

This does seem true, but this forgets the downstream effects of publishing flawed papers.

Future research in this area is stymied by reviewers who insist that the flawed research already solved the problem and/or undermines the novelty of somewhat similar solutions that actually work.

Reviewers will reject your work and insist that you include the flawed research in your own evaluations, even if you’ve already pointed out the flaws. Then, when you show that the flawed paper underperforms every other system, reviewers will reject your results and ask you why they differ from the flawed paper (no matter how clearly you explain the flaws) :/

Published papers are viewed as canon by reviewers, even if they don’t work at all. It’s very difficult to change this perception.

If you get such a simple-minded reviewer, you can push back in your response, or you can even contact the editor directly.

Reviewers are not all-powerful, and they don't all share the same outlook. After all, reviewers are just scientists who have published articles in the past. If you are publishing papers, you're also reviewing papers. When you review papers, will you assume that everything that has ever passed peer review is true? Obviously not.