Hacker News new | ask | show | jobs
by btrettel 57 days ago
Have you all considered adding scientific articles to your bibliographic database? Finding existing translations of scientific articles can be a real pain. I know because I spent a lot of time doing that during my PhD [1].

For a while I was collaborating with Victor Venema in the volunteer organization Translate Science [2] to try to create a bibliographic database of scientific translations, but unfortunately Victor died, and I became too busy to continue.

[1] https://academia.stackexchange.com/a/93209/31143

[2] https://translate-science.codeberg.page/

1 comments

Thanks for the link; Translate Science is exactly the kind of gap-filling project that makes sense once you see how fragmented the bibliographic layer is. Sorry to hear about Victor; I'd seen the repos but hadn't known.

Scientific translations are a different animal from what I've been working on, in ways that make them both easier and harder. Easier because scholarly communication already has a near-universal identifier (DOI) and, in principle, Crossref metadata. Harder because most translated articles never get their own DOI — they live as post-hoc PDFs on an author's site or inside an institutional repository (HAL, SciELO, J-STAGE, NII) with no machine-readable back-reference to the original, and the original's Crossref record almost never points at them. So the signal is worse than with books despite the underlying infrastructure being better.

The approach that might transfer: instead of trying to convince publishers or journals to register translations (they won't), scrape what's already sitting in institutional repositories and national scientific databases, then reconcile by author + title fingerprint + language. The multilingual matching pipeline I use for books is probably the right shape for the article problem too, though the authority side is messier there. ORCID helps; affiliations drift and make it harder.

Not something I'm committing to build, but I'd be curious to see what you and Victor had assembled if any of it is still reachable. Happy to compare notes offline if useful.

Thanks for the reply. You're right that the data for this is very fragmented. Victor was looking at Crossref metadata. I think he always had what he was doing on Codeberg, though I'm not sure. I was looking at arXiv and 1960s to 1980s printed translation indices listing translations on paper that are today in archives uncatalogued at the Library of Congress, British Library, and other libraries/archives. (The indices list which libraries have each translation and what it says is accurate for the Library of Congress in my experience.) OCR was not cooperating on turning my scans of the translation indices into something I could parse, despite the indices having a regular structure indicating that they were computer-generated. LLMs likely would help with that now, but all of this was pre-ChatGPT. My plan was to automatically convert the bibliographic data in the indices to DOIs, but as it turns out, a large fraction of the articles in the indices do not have DOIs. We ultimately did not consolidate these sources.

Anyhow, it's obviously a huge task and I don't expect you to build this. I was just curious if you had thought about it as you clearly have a lot of relevant infrastructure in place. If I ever get the time and interest to work on this again, I'll reach out to you.