| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zeegroen 91 days ago
	Oh that's interesting! In my mind we are now on the cusp of being able to scan all these archives and have them be read by LLMs (in a first pass). Do you agree with that assessment, or am I being naive here?

2 comments

giraffe_lady 91 days ago

I'm not in this field but I know someone who used to be and we've talked about it a fair bit. A quick overview of what's needed from what I understand:

Old books aren't that neat, you tend to have a lot of notes and other documents, translations, scribal annotations from different eras interleaved or in the margins. You need to make decisions about that stuff as you go, which requires being informed about the context and meaning of those documents, that may well be in another language, or from hundreds of years before or after the document you're trying to process. For any given physical object it's quite likely that no single scholar has all the information necessary.

It is also extremely important to preserve all the context, things like which exact pages a fragment is stuck between, even its orientation, can be critical information to later scholars. And then in all of this you're handling ancient & precious one of a kind paper documents. It's just slow going, and well beyond what I would even consider "skilled labor" this very much is the work of research & scholarship. By the time you get a camera pointed at a page you're at the easy part.

link

cyocum 91 days ago

This is pretty true in general. Many have spent entire careers doing cataloguing of manuscripts and what is in them. The Royal Irish Academy did that in the early to mid part of the 20th century. The National Library of Scotland also has done theirs. It is painstaking and often unappreciated work.

As for imaging, there is Irish Scripts on Screen (https://www.isos.dias.ie/) which covers many different places and time periods.

Answering the grandparent comment, LLMs are not good at Old Irish. Seriously, they are awful at it. There is just too little data for it to work. I wrote a very little bit about text clustering in Old/Middle Irish (see https://doi.org/10.1515/9783110680744-005). I think the better place to start is transcription and there are some tools out there which help, like Transcribus (https://www.transkribus.org/), which I haven't used but it looks useful.

edit:typos

link

qingcharles 91 days ago

Yes, it is really hard to digitize a lot of these documents in a way that retains all of the information where it should belong. It's easy to scan modern books because the text runs in neat blocks and the output is neat blocks. But some texts just don't want to be wrangled like that into neat sentences and paragraphs that we expect, and all the gloss gone.

That said, I've found the recent LLMs will happily accept an entire book of scanned pages (just the images) and summarize the complete contents in one single go, which definitely has a very useful purpose in cataloging and indexing publications. For a project I'm doing I have millions of documents in hundreds of languages where only images of the pages exist, so I'm trying to get a good idea of the contents, then a user can choose to open the document and read the full text in its original format and layout.

link

dhosek 91 days ago

Indeed, and then there’s the fact that a single codex may contain multiple works, often unrelated (at least to modern eyes—copying of manuscripts was the old school way of adding a book to one’s library, so an abbot in one monastery, learning that another monastery had works X, Y and Z might request they be copied or send a monk to copy them and even though X was a work of theology, Y a poem by Virgil and Z an account of the best way to raise green beans in poor soil, they’d end up together, possibly not even starting a new page when one work ends and the next begins.

And of course, title pages are a later invention so the only way to know what’s in a manuscript is to actually read it.

link

IAmBroom 91 days ago

Even digitizing sources this old entails quite a lot of manual labor.

link