| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by btrettel 58 days ago
	Thanks for the reply. You're right that the data for this is very fragmented. Victor was looking at Crossref metadata. I think he always had what he was doing on Codeberg, though I'm not sure. I was looking at arXiv and 1960s to 1980s printed translation indices listing translations on paper that are today in archives uncatalogued at the Library of Congress, British Library, and other libraries/archives. (The indices list which libraries have each translation and what it says is accurate for the Library of Congress in my experience.) OCR was not cooperating on turning my scans of the translation indices into something I could parse, despite the indices having a regular structure indicating that they were computer-generated. LLMs likely would help with that now, but all of this was pre-ChatGPT. My plan was to automatically convert the bibliographic data in the indices to DOIs, but as it turns out, a large fraction of the articles in the indices do not have DOIs. We ultimately did not consolidate these sources. Anyhow, it's obviously a huge task and I don't expect you to build this. I was just curious if you had thought about it as you clearly have a lot of relevant infrastructure in place. If I ever get the time and interest to work on this again, I'll reach out to you.