Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?
Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.
https://github.com/tabulapdf/tabula
Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.