| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mjt58 2761 days ago
	Have you tried e.g. https://tabula.technology, https://pdftables.com, https://pypi.org/project/Camelot/?

3 comments

counciltime 2753 days ago

I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/

link

minhtripham 2742 days ago

Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?

link

BasHamer 2761 days ago

https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.

link

ocrcustomserver 2761 days ago

You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.

link

cdolan 2761 days ago

Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur

link

mjt58 2760 days ago

Oh yes, thanks :-)

link