Hacker News new | ask | show | jobs
by qingcharles 44 days ago
Yes, it is really hard to digitize a lot of these documents in a way that retains all of the information where it should belong. It's easy to scan modern books because the text runs in neat blocks and the output is neat blocks. But some texts just don't want to be wrangled like that into neat sentences and paragraphs that we expect, and all the gloss gone.

That said, I've found the recent LLMs will happily accept an entire book of scanned pages (just the images) and summarize the complete contents in one single go, which definitely has a very useful purpose in cataloging and indexing publications. For a project I'm doing I have millions of documents in hundreds of languages where only images of the pages exist, so I'm trying to get a good idea of the contents, then a user can choose to open the document and read the full text in its original format and layout.