Hacker News new | ask | show | jobs
by mjt58 2761 days ago
Have you tried e.g. https://tabula.technology, https://pdftables.com, https://pypi.org/project/Camelot/?
3 comments

I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/

Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?
https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.
You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.

Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur
Oh yes, thanks :-)