Hacker News new | ask | show | jobs
by ray991 2217 days ago
I really wish the PDF layout was easier to parse. No matter which library you use, you always run into edge cases which make text selection and extraction an issue on certain files. I was recently extracting financial data from a bank which provides only PDFs and every time they changed the format just a little bit I had to change large parts of my code to extract the transactions I wanted.
3 comments

PDF is designed to present a human-readable document, not to serve as a data interchange format.
I agree to this, it's the same with insurance companies too when resolving claims. Feels like they certainly want to make the extraction look complicated for an unknown reason. Not often and not all companies but edge cases
I’m sure you’ve looked at it but I have a lot of success with pdftotext -layout