| HN Mirror

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.