Hacker News new | ask | show | jobs
by sam-2727 1644 days ago
I think this is a great idea theoretically, but in reality for most papers I don't want to see the data/underlying code. While it would be great to publish data/code with the paper (in the field I've worked on the most, astronomy, most data is already published with the paper anyways), I don't want/need to look through a notebook with the underlying code of the paper in order to just read the intro/conclusions (and maybe one key methods section). Interactive figures are a great idea, but again, oftentimes I don't really care to interact with the figure, or fiddle stuff around, I just want to know why the paper is important and how I should use its conclusions. The two-column format of most papers is very useful for skimming. So instead I would argue notebooks shouldn't replace papers, but supplement them (as they sometimes do already, in fact, but perhaps journals could make it an actual requirement to create a supplementary notebook).

As the article mentions, scientific fields are gigantic nowadays, and skimming papers is critical when you're citing 100+ references in your paper.

8 comments

EMBL-EBI and others had some RDF-related effort to provide machine readable abstracts, which I thought was a really cool idea.

IMHO, the biggest problem with papers is politics and reviews. In many top journals like Nature there's no double-blind review (actually in Nature it's now optional but big groups never use it). And even if there was double-blind review, referees have no skin in the game. So the usual outcome is to get reviewed by a big name in your field, who is actually interested in controlling research trends and killing "competitors".

This is hindering progress and hurting new ideas. For example, proponents of Alzheimer's disease being caused by an infection or dysbiosis have had a hard time to do research, get grants and publish articles during the last 2 decades. Despite their theory is able to explain the etiology quite well, unlike competing alternatives.

Another problem is that to publish in good journals you need cool results. Cool results are rare, but Nature, Science, Cell et al. are full of articles every month. So, most groups are overselling and misreporting things. Research fraud, p-value hacking and data manipulation are really common.

> referees have no skin in the game

That's a big problem. I just got a paper rejected. Reviewer 1 was just focalized on a single detail I mentioned somewhere, not central at all in the paper yet is basing most on this criticism around that. Reviewer 2 has difficulty understanding a table containing 2 columns and 3 rows, and what means N, V, ADJ and ADV in a paper about dictionary (not to mention the same abbreviation was used just before, and used a plain words numerous times in multiple paragraphs). Reviewer 3 is the only one saying a remotely nice thing and who seems to have grasped what the paper is about. There is of course some valid criticism raised in the reviews, but half of it is bullshit that would be dispelled in a more interactive process or/and if reviewers had incentives to actually put a minimal effort to understand the paper.

It’s not really possible to conduct double-blind reviews in most cases: authors or at least the group can often be easily guessed from the list of references, “in our previous work…”, and research domain and approach in general.
Sensible anonymisation policies prevent people from referring to "our previous work" in submissions - e.g., the policy for CHI [1] states:

> We do expect that authors leave citations to their previous work unanonymized so that reviewers can ensure that all previous research has been taken into account by the authors. However, authors are required to cite their own work in the third person, e.g., avoid “As described in our previous work [10], … ” and use instead “As described by [10], …”

However, it is true that things like choice of research questions, approach, and equipment used can be quite suggestive of the authors' identity.

[1]: https://chi2020.acm.org/authors/papers/chi-anonymisation-pol...

You should always want to have the underlying code available. Without the exact procedures they used to process their data, the only kind of "using their conclusions" you can do is the superficial "take it at face value" kind. So many important details get hand-waved away in papers that say things like "we used the well known blahblahblah method to analyze the data."

If you do it right, the code should in no way interfere with your ability to read abstracts.

I think I can convince you otherwise.

If I publish a paper saying I have an algorithm which can factor large composites, and in the paper publish the factors to all of the RSA numbers listed at https://en.wikipedia.org/wiki/RSA_Factoring_Challenge , then I think people will take it seriously, and not consider it at the superficial level.

Even if I don't publish the algorithm. ("Because of the security implications of this work, I have decided to withhold publication for a year.")

Furthermore, some things are worth publishing even if the methods was "it came to me in a dream" à la Kekulé's snake. If you can demonstrate a sorting network of size 47 for n=14 input (which is the known lowest bound) then you can publish that exemplar, even without publishing the method used to generate it.

(If you used computer assistance then that method would likely also be publishable, but that's a different point. Newton famously used the calculus to solve problems, but published their proofs using more traditional approaches.)

If you can come up with a protein model that is a significantly better fit to the X-ray diffraction data, then that's publishable too, no matter how you came up with that model.

In all of these cases, there are ways to verify the validity of the results without reproducing the methods used to come up with the result.

This won’t work for empirical research. I vividly recall weeks spent trying to reproduce a paper on information retrieval (a deep learning model). What saved me is skimming through the author’s codebase and chancing upon an undocumented sampling step. They were only using the first and last passage in a document as training data and uniformly sampling from 10% of the remaining passages, and the paper didn't mention this. I adopted their sampling strategy, and i was able to obtain their results.

My argument is that there are nuances and subtleties that are often omitted in a paper (accidentally or otherwise), but are nevertheless required to reproduce the research.

My example of a protein model is an example of empirical research, yes?

My understanding is the X-ray gives you a diffraction pattern which is hard to invert to a structure, while if you have the structure the diffraction pattern is easy to compute. The diffraction pattern therefore gives you a way to verify that one model is a better fit than another model.

It may not be perfect, certainly not. It might not even be correct once more data arrives. But if you predict a novel fold, and that fold matches the diffraction pattern significantly better than the current model, then it doesn't matter how you came up with the new fold, does it?

It could have been a dream. It could have been search software. The result is still publishable.

All of what you have said is true, but my point is for some research being able to verify the correctness of the result is all that matters, not being able to reproduce the research.

Can you reproduce Kekulé's dream?

What do you see as the fundamental point of scientific communication? In your counterpoints you narrow in on papers being a means of communicating concepts or proof of work. In this view, showing the process itself is pointless or at least irrelevant to the main axiom.

However, others (myself included) see the the communication of methods as a primary function of the literature, because this is what enables others to understand, critique, and build upon the idea.

There is no single fundamental point.

If you want to be that broad about it, science journals publish a lot more than just method development, including obituaries and opinion pieces on where funding should be directed.

Here's a famous paper showing that "Euler's conjecture on sums of like powers" is incorrect - https://www.ams.org/journals/bull/1966-72-06/S0002-9904-1966... . I will repeat the body in full:

> A direct search on the CDC 6600 yielded 27⁵ + 84⁵ + 110⁵ + 133⁵ = 144⁵ as the smallest instance in which four fifth powers sum to a fifth power. This is a counterexample to a conjecture by Euler [l] that at least n nth powers are required to sum to an nth power, n>2.

Do I need to know how the direct search was carried out to confirm Euler's conjecture was false?

No.

  >>> 27**5 + 84**5 + 110**5 + 133**5 == 144**5
  True
And now that you know it isn't true, you might adjust which project areas to spend your time on. Which is part of what we get from scientific publications.

Just because you prefer one sort of scientific research doesn't mean other forms aren't science.

Again, is Kekulé's model of the benzene ring less scientific because it came to him in a daydream?

We accept Newton's publications where he secretly used the calculus, even though he didn't publish the calculus, because they could be proved through other more laborious means.

Why is it not scientific to write publications which use secret software, so long as we can verify the results?

Yes, there are occasional exceptions where you don't have to repeat or replicate the experiments reported in a paper to verify them. But that is very much the exception.

Generally you are expected to explain what you did in enough detail that the reader can replicate your experiment. If you're fitting a protein model to X-ray diffraction data, you aren't expected to include all the other protein models you considered that didn't fit, or explain to the reader your procedure for generating protein models, but you are expected to explain how you measured the fit to the X-ray diffraction data (with what algorithms or software, etc.) so that the reader can in theory do the same thing themself.

Sure, but "I found the structure after 5 months playing around with it in Foldit" isn't that reproducible or informative either.

The result is still the same - a novel fold which is a significantly better fit than existing modules, based on measured vs. predicted x-ray diffraction patterns and whatever other data you might have.

Which is publishable, yes?

When the Wikipedia entry at https://en.wikipedia.org/wiki/Foldit says "Foldit players reengineered the enzyme by adding 13 amino acids, increasing its activity by more than 18 times", how is that much different than "A magical wizard added 13 amino acids, increasing its activity by more than 18 times"?

Or "secret software".

What's publishable is that the result is novel (and hopefully interesting), and can be verified. The publication does not require that all step can be repeated.

I agree!

Unfortunately we have a long way to go to make it easy to repeat the calculation that a novel structure is "a significantly better fit than existing modules, based on measured vs. predicted x-ray diffraction patterns". (If I run STEREOPOLE and it says the diffraction pattern from your new structure is a worse fit, is that because I'm running a different version of IDL? Maybe there's a bug in my FPU? Or the version of BLAS my copy of IDL is linked with? Or you're using a copy of STEREOPOLE that a previous grad student fixed a bug in, while my copy still has the bug? And stochastic software like GAtor is potentially even worse.)

This is something we could and should completely automate. There's been work on this by people like Konrad Hinsen, Yihui Xie, Jeremiah Orians, Eelco Dolstra, Ludovic Courtès, Shriram Krishnamurthi, Ricardo Wurmus, and Sam Tobin-Hochstadt, but there's a long way to go.

>Yes, there are occasional exceptions where you don't have to repeat or replicate the experiments reported in a paper to verify them. But that is very much the exception.

And even in this exceptional case, the algorithm itself is interesting above and beyond the fact of its existence.

It is, but if the algorithm produces a result such as a protein structure or a sorting network that is itself novel and verifiable, you can very reasonably publish that result separately. As long as it doesn't require knowing the search algorithm to replicate your result that the sorting network sorts correctly, which it wouldn't.
> If I publish a paper saying I have an algorithm which can factor large composites, and in the paper publish the factors to all of the RSA numbers

If some factors of those numbers are also large composites, without access to a good algorithm, nobody can truly verify your claims.

If not and you include all of those factors in an easily digestible way for computers to process (let's call that "code"), it will be easy for anyone to reproduce your results (run that code which multiplies all the factors and gets the resulting RSA numbers).

With code, they could easily check that there's not an error in your verification method too (eg. large number multiplication broken).

This would achieve both goals: you'd withhold your algorithm for security reasons, and your results would be easier to verify.

Edit: but to be honest, I think withholding the research is a bit of a special case. You are doing it on purpose, and you can easily offer a service to prove your algorithm works (eg. imagine a "factoring" web service that instantly gives you a hash of the resulting sequence of factors, and then only mails you the actual sequence in two days).

It's a lot easier to check if something is a prime than to factor it.
The point of interactive notebooks is not seeing and having access to all the data - it's seeing the abstractions at work, having a direct grasp of how they act on particular examples as an aid to understand their formal definition.

Nothing prevents you from having two-column notebooks, if you find that advantageous, as well as abstract and conclusions sections. The part that you don't get with static paper is that of navigating the abstraction ladder[1] up and down with direct manipulation aids, instead of having to work it all in your head or by following dense detailed paragraphs.

[1] As also explained by Bret Victor in http://worrydream.com/LadderOfAbstraction/

I have made an experiment with my last paper: Write everything from scatch in Jupyter Notebook, including data preprocessing and generation of all figures (etc.) (10 Notebooks in total). Start of the conceptualization was in 2017, we just submitted it 2 weeks ago (it got desk rejected for not fitting the journals topic).

I learned a lot and it was definitly worth it. The next paper will be easier with this knowledge. Nonetheless, there is an overhead and I feel that this overhead is not valued with the current makeup of journals, where you really need to dig deep to find any supplementary materials.

I did [something similar] too when I started my PhD ... I had one Makefile managed project that ran everything with dependencies. From raw data, to figures and even embedding the numbers into the final, Latex-based PDF.

My supervisor manually copied all of the text from my PDF into a word document on his first revision ...

Depending on the stuff you do, emacs org-mode is worth a shot. I write all my papers in org.
I think having the ability to focus on the things you care about the paper mostly is what would be more beneficial for all readers. You care more about an overview? You can easily find it (perhaps with graphics and walkthroughs), you care more about proofs? Then you can get them, what about code and experiments? And so on and so forth.

Readability and scalability is about making all this data available in the publication record, but easy to navigate for whoever is looking for whatever.

Moving the burden of assessing everything in a paper completely to the reader is an interesting idea but seems somewhat a step back when at the same time good and curated data gets ever more expensive. So the market for validated results is already not bad where those results "matter".

And not every paper has a lot of code or data associated with it. If you do experiments on organisms etc. then there is so much happening in the actual lab work - where would that go? Endless hours of video documentation?

In a ipython notebook you can fold away "blocks" of code, that means you can have everything there that produces the graphs and still be able to look under the hood if you like to.

Isn't that the practical part about digital technology? That you are not limited to one view?

I don't know about this... considering how well some of you guys write papers, I'd rather look at the results and the code than read your paper.