| “PDF is good at what it's supposed to be good.” QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript. https://en.wikipedia.org/wiki/PostScript (And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.) What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support. https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov... Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew? i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard. (As for who’s to blame for that, I leave others to join up the dots.) |