I have around 2 million pages from FOIA requests that need information systematically extracted and I'm not alone in this problem. The costs for the systematization of many pages will be prohibitive.
The public good of having a resource like this available to the public for free is beyond unimaginable as far as I'm concerned.
Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal.
Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.
The public good of having a resource like this available to the public for free is beyond unimaginable as far as I'm concerned.