| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sourc3 4010 days ago

I have been working on a side project that needs to read dynamic table layouts and extract financial information. I was excited to hear about Tabula a few weeks ago but I had 0 success in getting even one PDF extracted.

I ended up using pdfquery package in python which heavily utilized PDFMiner under the covers.

Besides ABBYY soft (which is proprietary, licensed), does anyone have other recommendations?

4 comments

peterwaller 4010 days ago

Shameless plug: https://pdftables.com

link

kaitai 4010 days ago

I have tried this and it was very useful as well.

link

baldfat 4010 days ago

I can't help but say I refuse to work with PDF files. I will email and do a ton of meetings and one on ones to explain that PDF is a container and that the format inside the container is the battle. Just give me the plain format and if it cost the company money it is worth it.

link

leejoramo 4010 days ago

Much of the use of these tools is to extract data from government or corporate sources that while required to publish the information may not want make it easy to access. Thus they prefer PDF's.

Those of us trying to extract the data bound up in these PDF's do advocate to get access to the original data, but we have to deal with what we have today.

link

baldfat 4008 days ago

And this is not good for anyone and is the opposite of the spirit behind the Sunshine Laws.

My school district (What a mess) publishes images (Horrible bad images) of all the school notes including all financial information and spreadsheets. I had to one night type in for 4 hours manually the years budget just to check on our spending per student. It was $5,400 the lowest in our state.

link

knowtheory 4010 days ago

It's nice that you can opt out of working with PDFs but that's not an option for a large portion of the world.

link

dunham 4010 days ago

I usually use "pdftotext -layout" and write python or perl code to handle the table extraction.

If I need more detailed formatting information, I use "pdftohtml -xml -fullfontname" and process the resulting xml.

link

DenisM 4010 days ago

What about ABBY, does that work well for you?

I'd even pay money to get somethig that works well.

link