| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amenod 1392 days ago
	That sounds doable, but... why opensource? Does this mean that people / companies are not prepared to pay for the product or service?

1 comments

chaps 1392 days ago

I have around 2 million pages from FOIA requests that need information systematically extracted and I'm not alone in this problem. The costs for the systematization of many pages will be prohibitive.

The public good of having a resource like this available to the public for free is beyond unimaginable as far as I'm concerned.

link

Fizzz 1392 days ago

How do you currently extract info from the FOIA request pages? What kind of info do you look for? Just thinking how you could standardise this

link

chaps 1392 days ago

Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal.

Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.

link