Hacker News new | ask | show | jobs
by sourc3 3963 days ago
I have been working on a side project that needs to read dynamic table layouts and extract financial information. I was excited to hear about Tabula a few weeks ago but I had 0 success in getting even one PDF extracted.

I ended up using pdfquery package in python which heavily utilized PDFMiner under the covers.

Besides ABBYY soft (which is proprietary, licensed), does anyone have other recommendations?

4 comments

Shameless plug: https://pdftables.com
I have tried this and it was very useful as well.
I can't help but say I refuse to work with PDF files. I will email and do a ton of meetings and one on ones to explain that PDF is a container and that the format inside the container is the battle. Just give me the plain format and if it cost the company money it is worth it.
Much of the use of these tools is to extract data from government or corporate sources that while required to publish the information may not want make it easy to access. Thus they prefer PDF's.

Those of us trying to extract the data bound up in these PDF's do advocate to get access to the original data, but we have to deal with what we have today.

And this is not good for anyone and is the opposite of the spirit behind the Sunshine Laws.

My school district (What a mess) publishes images (Horrible bad images) of all the school notes including all financial information and spreadsheets. I had to one night type in for 4 hours manually the years budget just to check on our spending per student. It was $5,400 the lowest in our state.

It's nice that you can opt out of working with PDFs but that's not an option for a large portion of the world.
I usually use "pdftotext -layout" and write python or perl code to handle the table extraction.

If I need more detailed formatting information, I use "pdftohtml -xml -fullfontname" and process the resulting xml.

What about ABBY, does that work well for you?

I'd even pay money to get somethig that works well.