Hacker News new | ask | show | jobs
by amenod 1392 days ago
That sounds doable, but... why opensource? Does this mean that people / companies are not prepared to pay for the product or service?
1 comments

I have around 2 million pages from FOIA requests that need information systematically extracted and I'm not alone in this problem. The costs for the systematization of many pages will be prohibitive.

The public good of having a resource like this available to the public for free is beyond unimaginable as far as I'm concerned.

How do you currently extract info from the FOIA request pages? What kind of info do you look for? Just thinking how you could standardise this
Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal.

Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.