| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ray991 2217 days ago
	I really wish the PDF layout was easier to parse. No matter which library you use, you always run into edge cases which make text selection and extraction an issue on certain files. I was recently extracting financial data from a bank which provides only PDFs and every time they changed the format just a little bit I had to change large parts of my code to extract the transactions I wanted.

3 comments

jfk13 2216 days ago

PDF is designed to present a human-readable document, not to serve as a data interchange format.

link

saradhi 2217 days ago

I agree to this, it's the same with insurance companies too when resolving claims. Feels like they certainly want to make the extraction look complicated for an unknown reason. Not often and not all companies but edge cases

link

dmoo 2217 days ago

I’m sure you’ve looked at it but I have a lot of success with pdftotext -layout

link