| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by giovannibonetti 1230 days ago
	Since you are working with raw text, it shouldn't need too much effort. There are a bunch of open source tools to extract text from PDFs. The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough.