Hacker News new | ask | show | jobs
by sargstuff 748 days ago
Not sure if a multi-step is ok, but convert pdf to image format such as png, use AI to recognize 'tabular blocks', convert pdf to 'text format' with tabular blocks as embeddable image to preserve spacing.

https://stackoverflow.com/questions/3203790/parsing-pdf-file...

https://excalibur-py.readthedocs.io/en/master/

https://ledgerbox.io/blog/extract-tables-with-tesseract-ocr

https://www.johnsnowlabs.com/extract-tabular-data-from-pdf-i...

bit more in-depth review : https://dev.to/upsilon_it/how-to-extract-tabular-data-from-p...