Hacker News new | ask | show | jobs
by wanderfowl 2516 days ago
I'd love to see a norm develop where the 'authoritative link' to an article is expected to be the most open. So, if there's a closed journal and an Arxiv pre-print, Arxiv gets the link, with the journal's publication status considered 'about the article', but not the thing itself.

I think it moves us towards a clearer understanding of Academic Journal publication as peer review's 'stamp of approval', rather than the explanatory event per se. And this will make easier to move towards long-term, sustainable practices for publication and science.

5 comments

Journals used to have several important roles: curation of articles, maintaining a reputation of quality (peer review, etc), and the actual physical publication and distribution of the papers. Cheap personal computers capable of "desktop publishing" and the internet made publication and distribution really cheap and easy. Those tasks no longer require a lot of expensive specialized skills and expensive typesetting/printing tools. This means journals need to stop treating those tasks as if they were still a scarce resource, and rework their business model around the tasks people still value highly: high quality curation and a reliable and trustworthy reputation.

The actual hosting of the PDFs (and TeX, and hopefully even the raw data) is something that universities or whomever the researchers are working for could host cheaply and easily. When I was attending UC Davis in the late 90s, the university hosted a huge archive that not only included their own publications, it also mirrored the publications of the other UCs and many important public archives like kernel.org.

Compared to huge archives of Linux distros and pre-GIT source code histories, hosting a bunch of PDF/TeX is effectively free. Reliable curation that saves a lot of people from wasting their own time and effort trying to find useful/interesting papers is extremely valuable.

First we need a "verified" badge in biorxiv/arxiv to verify that the current preprint version is exact copy of the one published in the journal. Then DOI could arrange to redirect to that copy instead
That is not currently how the DOI infrastructure works at all.

Individual entities register DOIs, and decide where they redirect (and can change the resolution at any time). In these cases, the publisher (such as Elsevier) is the one who has registered the DOI, and they get to decide where it redirects/resolves. They also paid for the DOIs.

There are actually a (small-ish) number of DOI registrars. The largest, and most likely by far to be used for scholarly articles, is CrossRef.

Neither CrossRef nor the DOI foundation have the authority to change where a DOI resolves to, against the wishes of the DOI registrant. (It would be like a DNS registrar or the IANA deciding news.ycombinator.com should resolve somewhere other than Y Combinator wants it to -- indeed DOI works pretty analogously to DNS, probably intentionally by inspiration).

What you propose would require major changes to the social and business setup of DOI. Probably to the business/sustainability model too, because a registrant would probably be less excited to pay for a DOI they don't actually get to control the resolution of. (CrossRef and the International DOI Foundation are both non-profits. They still need to pay for their operations, and the DOI infrastructure. That is currently funded by charging registrants for DOIs). It would also require some kind of "regulatory regime" to determine who has the authority on what basis to determine where a DOI resolves (and those 'regulators' would probably increase expenses, which you need a new plan for funding), compared to the current situation where whatever entity registered a DOI decides where it resolves to (similar to DNS).

You need neither. Simply hash both articles, and reference it by hash. Then you will automatically get the right paper, no matter the source (it could even be from a bittorrent magnet link).

DOI are horrible invention, they are prone to man in the middle attacks and dead links, please don't use them.

A slight impediment to that is that ArXiv discards PDFs that have not been accessed in a while, and rebuilds them from TeX source if later accessed. The result may not have the same hash - I sometimes even see ArXiv PDFs with today's date in them despite being published a long time ago, because the author used the \today macro. So you would need reproducible builds for the hashes to be valid, or for ArXiv to no longer have the storage concerns thst lead them to this practice. Or you could hash the TeX I suppose.
Yeah, you should hash the TeX. It's a pity really that PDF has become the dominant publication format, it's just so bad and non machine readable. It's absurd to me that scientific publications haven't switched over to HTML, I mean that format was invented for scientific publication...
I think the consistency of PDF is why it's favoured. It's also completely self contained where HTML has severe issues with long term access.
Could you elaborate on how you feel that html isn't self contained, and how there are issues with long term access?
Imo tex isn't much more machine readable, depending on what you want to do. Reformatting or lossy conversion to plaintext? Sure. Determining semantics? Good luck.
That sounds like an overkill thing to do nowadays
The journal version and the arxiv version will never hash to the same value because they are not bit-identical. But you want to link to the peer reviewed version, or one which is semantically identical to the peer reviewed version. So somebody needs to check that the arxiv version is semantically identical to the journal version.
You should hash the TeX, not the PDF. Alternatively you could have both documents PGP signed by the author with a hash of the original tex, if you want to make sure you get the right "semantically the same but different" version. But tbh that seems to be a slippery slope that I wouldn't want to go on, where do you draw the line for your semantic differences? Imagine you quote something which gets edited out, suddenly it looks like you quote nonsense while it's the original references fault.
There is no TeX source for the journal version. The point is that you don't want to trust the author to verify that the peer-reviewed+accepted version is the same as the arxiv version, and that it will not be changed. That's why people generally cite the journal version. Because it's immutable.
Journal versions are simply not immutable because they are referenced by name, not by content. I regularly see a good percentage of dead or wrong DOI, and I've hunted my fair share of papers that were supposedly released in a journal, but that only ever existed in preprint.

Arxiv already accepts latex and compiles it for you, we should expect the same from journals and ask them to publish the hash of the document they received.

Why would the author not be trusted? Why do they stand to gain? Arxiv can make the final version immutable too
Also in many cases there is a final round of modifications done by the publisher that you are not free to distribute. For journal paper I was told that sometimes you cannot even publish the corrected version after rebuttal.
it's not the same file - just the same, final text proof. It will be different from the final formatting in the journal.

I dont think authors have incentive to abuse the system. Just upload the final proof of your manuscript to arxiv, click "final version" , and this lets people know that this is the same article as in the journal.

DOIs are ubiquitous and they would serve the purpose of redirecting to the free pdfs rather then the journal site. This can be applied to existing articles retroactively. Plus, many bibliography styles include the DOI which makes the reference easier to use

DOIs are controlled by the journal publisher, so I don't see why they would be willing to change their target.
yeah good point. Maybe doi.org could resolve them differently? In any case it s the reference identifier that could connect the two documents
Semantic Scholar[0] tends to do this, but their search functionality leaves something to be desired. I tend to use them to discover DOI addresses and find related media if I already know the paper's title (e.g. following-up on a bibliography w/o links, as is the norm in many publications)

[0] https://www.semanticscholar.org

There's already such a norm. Use https://oadoi.org/ !
It would be great to see that on HN, so that non-paywalled articles were considered more authoritative than paywalled articles.
Most of the paywalled articles here come from newspapers who pay their authors unlike Elsevier.