| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by meetingthrower 971 days ago
	Awesome - I'm a dabbler, but any thoughts on best engines for PDF tables? I've got tons of PDFs with similar tables embedded deep in them, but all formatted slightly differently. Seems like it should be easy....but nope!

4 comments

janderson215 971 days ago

Are you able to highlight the text on the PDF? If so, I highly recommend PDF2TXT to extract text from PDFs. Would require some parsing work on your part to convert it back to a table, but zero chance of error from inference since it’s using text extraction.

If you can’t highlight the text, it won’t work.

link

adr1an 971 days ago

You can make any PDFs 'highlightable' with GitHub.com/ocrmypdf

link

lhuser123 971 days ago

It’s not perfect, unfortunately.

link

junhoyeo 971 days ago

Thanks!

PDF -> Markdown looks like a pretty great use case

Just added box detection support -- maybe I'll start from here https://github.com/junhoyeo/BetterOCR#-box-detection