| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sargstuff 748 days ago

Not sure if a multi-step is ok, but convert pdf to image format such as png, use AI to recognize 'tabular blocks', convert pdf to 'text format' with tabular blocks as embeddable image to preserve spacing.

https://stackoverflow.com/questions/3203790/parsing-pdf-file...

https://excalibur-py.readthedocs.io/en/master/

https://ledgerbox.io/blog/extract-tables-with-tesseract-ocr

https://www.johnsnowlabs.com/extract-tabular-data-from-pdf-i...

bit more in-depth review : https://dev.to/upsilon_it/how-to-extract-tabular-data-from-p...