|
|
|
|
|
by daniel-levin
2298 days ago
|
|
I’m a contractor. One of my gigs involved writing parsers for 20-something different kinds of pdf bank statements. It’s a dark art. Once you’ve done it 20 times it becomes a lot easier. Now we simply POST a pdf to my service and it gets parsed and the data it contains gets chucked into a database. You can go extremely far with naive parsers. That is, regex combined with positionally-aware fixed-length formatting rules. I’m available for hire re. structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve (eg for when OCR thinks 0 and 6 are the same) |
|
A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).
As someone else said, this is like black magic, but kind of fun :)
Edit: grammar