Hacker News new | ask | show | jobs
by clickok 2107 days ago
It would be nice if citing repositories were easier-- either for generating a reference for my own code or acknowledging when I've used someone else's code in my research.

There's tons of math and physics blogs that contain useful results that the author wanted to make available but didn't manage to incorporate into a paper. I wonder if there'd be any interest in a sort of GitHub for proofs? It could even use git, since (assuming consistency) isn't math just a DAG anyways (and therefore isomorphic to a neural net, as are all things).

3 comments

Traditionally that sort of stuff goes in tech reports, dissertations, or text books.

What's missing is the dissemination piece. Somehow people will absolutely refuse to take seriously the job of citing code they use, even when their main result is obtainable by "and then I ran something from scipy/numpy/pytorch/etc."

You can get a free DOI for and archive a tag of a Git repo with FigShare or Zenodo.

If you have repo2docker REES dependency scripts (requirements.txt, environment.yml, postInstall,) in your repo, a BinderHub like https://mybinder.org can build and cache a container image and launch a (free) instance in a k8s cloud.

Journals haven't yet integrated with BinderHub.

Putting the suggested citation and DOI URI/URL in your README and cataloging citations in an e.g. wiki page may increase the crucial frequency of citation.

A Linked Data format for presenting well-formed arguments with #StructuredPremises would help to realize the potential of the web as a graph of resources which may satisfy formal inclusion criteria for #LinkedMetaAnalyses.

The issue is that none of the citation count engines (Google scholar, scopus, Web of Science...) count citations on those DOIs. So for a researcher who needs to somehow demonstrate impact through citation counts, it does not really help unfortunately.
We could reason about sites that index https://schema.org/ScholarlyArticle according to our own and others' observations. Google Scholar, Semantic Scholar, and Meta all index Scholarly Articles: they copy the bibliographic metadata and the abstract for archival and schoarly purposes.

AFAIU, e.g. Zotero and Mendeley do not crawl and index articles or attempt to parse bibliographic citations from the astounding plethora of citation styles [citationstyles, citationstyles_stylerepo] into a citation graph suitable for representative metrics [zenodo_newmetrics].

bitcoin.org/bitcoin.pdf does not have a DOI, does not have an ORCID [orcid], and is not published in any journal but is indexed by e.g. Google Scholar; though there are apparently multiple records referring to a ScholarlyArticle with the same name and author. Something like "Hell's Angels" (1930)? No DOI, no ORCID, no parseable PDF structure: not indexed.

AFAIU, Google Scholar does not yet index ScholarlyArticle (or SoftwareApplication < CreativeWork) bibliographic metadata. GScholar indexes an older set of bibliographic metadata from HTML <meta> tags and also attempts to parse PDFs. [gscholar_inclusion]

Google Scholar is also not (yet?) integrated with Google Dataset Search (which indexes https://schema.org/Dataset metadata).

FigShare DOIs and Zenodo DOIs are DataCite DOIs [figshare_howtocite, zenodo_principles]; which apparently aren't (yet?) all indexed by Google Scholar [rescience_gscholar].

IIUC, all papers uploaded to https://arxiv.org are indexed by Google Scholar. In order for arxiv-vanity.org [arxiv_vanity] to render a mobile-ready, font-resizeable HTML5 version of a paper uploaded to ArXiV, the PostScript source must be uploaded. Arxiv hosts certain categories of ScholarlyArticles.

JOSS (Journal of Open Source Software) has managed to get articles indexed by Google Scholar [rescience_gscholar]. They publish their costs [joss_costs]: $275 Crossref membership, DOIs: $1/paper:

> Assuming a publication rate of 200 papers per year this works out at ~$4.75 per paper

[citationstyles]: https://citationstyles.org

[citationstyles_stylerepo]: https://github.com/citation-style-language/styles

[gscholar_inclusion]: https://scholar.google.com/intl/en/scholar/inclusion.html#in...

[figshare_howtocite]: https://knowledge.figshare.com/articles/item/how-to-share-ci...

[zenodo_principles]: https://about.zenodo.org/principles/

[zenodo_newmetrics]: https://www.frontiersin.org/articles/10.3389/frma.2017.00013...

[rescience_gscholar]: https://github.com/ReScience/ReScience/issues/38

[arxiv_vanity]: https://www.arxiv-vanity.com/

[joss_costs]: https://joss.theoj.org/about#costs

[orcid]: https://en.wikipedia.org/wiki/ORCID

Owning to the distributed nature of git, and the properties of the hashes it uses, it is probably enough to put a full commit id in a paper to securely reference a software project, regardless of its hosting platform.

We'd just need a dedicated search engine, and a way to automatically extract those from papers, to clone and archive repos.

> the properties of the hashes [g]it uses

Git uses SHA-1, a hardened version since 2017, and are now doing per-repo upgrades to SHA-256 [0]. Lots of repos are presumably still on SHA-1 (and users on older versions of git).

As of 2020, chosen-prefix attacks against SHA-1 are now practical. [verbatim from 1] But I don't think second preimage attacks are practical yet.

Linus Torvalds argued in 2006 basically that it's irrelevant whether git's hash function is second preimage resistant. Selective quoting:

> remember that the git model is that you should primarily trust only your _own_ repository [2]

> [a malicious] collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's _literally_ no different from the attacker just not having found a collision at all [2]

All that is just to say: git originally chose its hashes for the above mentioned "git model", thus didn't 100 % care about second preimage resistance. For your suggested search engine, depending on how the database is collected you might not be able to trust "your own repository" (if it's crowdsourced I could register another codebase with the same hash as Linux). A second preimage resistant hash function would be a requirement for the suggested use case.

[0]: https://git-scm.com/docs/hash-function-transition/

[1]: https://en.wikipedia.org/wiki/SHA-1#cite_ref-8

[2]: https://marc.info/?l=git&m=115678778717621&w=2