| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arathore 1927 days ago
	Great project! I've had success using camelot-py (https://camelot-py.readthedocs.io) to extract tabular data from PDFs (for images, I use imagemagick to convert those to PDF). If your table has borders the default method (lattice) works quite well. For non-bordered table there is the option to use 'stream' option but usually requires bit more preprocessing to get usable results.

1 comments

f430 1927 days ago

how does camelot extract tables from pdf? does it convert to image and then does OCR?

link

vortex_ape 1927 days ago

Hey! Camelot maintainer here. You can check out this doc for details on how Camelot extracts tables from PDFs: https://camelot-py.readthedocs.io/en/master/user/how-it-work...

As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.

[1] https://github.com/camelot-dev/camelot/pull/209

link

mkl 1927 days ago

From the link: "Camelot only works with text-based PDFs and not scanned documents." If you have character data, using it is almost always going to be more accurate than OCR.

I don't know how OP uses it with images converted to PDFs though, as that would be just like a scan, and ImageMagick doesn't do OCR as far as I can tell.

link

punnerud 1927 days ago

It uses pytesseract and Open-CV, so there is image processing.

link

mkl 1927 days ago

Looks like it's a bit in-progress: https://github.com/camelot-dev/camelot/pull/209

"Update docs" isn't checked, and that's what I was going on.

link

vortex_ape 1927 days ago

Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!

link