Hacker News new | ask | show | jobs
by aggerdom 2301 days ago
Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:

# Parser drift and maintenance hell

Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.

You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.

A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.

A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:

pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)

pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)

pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)

What's the common factor between these version changes? The return address is determining the version.

PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.

# PDFs that aren't standards compliant

In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.

TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.