| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chaps 1392 days ago
	Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal. Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.