| HN Mirror

PDFs are the opposite of machine-readable if you want to do anything other than render them as images on paper or a screen. They're only slightly more machine-readable than binary executables.

I hate, hate, hate, hate, hate the practice of using PDFs as a system of record. They are intended to be a print format for ensuring consistent typesetting and formatting. For that, I have no quarrel. But so much of the world economy is based on taking text, docx (XML), spreadsheets, or even CSV files, rendering them out as PDFs, and then emailing them around or storing them in databases. They've gone from being simply a view layer to infecting the model layer.

PDFs are a step better than passing around screenshots of text as images - when they don't literally consist of a single image, that is. But even for reasonably-well-behaved, mostly-text PDFs, finding things like "headers" and "sections" in the average case is dependent on a huge pile of heuristics about spacing and font size conventions. None of that semantic structure exists, it's just individual characters with X-Y coordinates. (My favorite thing to do with people starting to work with PDFs is to tell them that the files don't usually contain any whitespace characters, and then watch the horror slowly dawn as they contemplate the implications.) (And yes, I know that PDF/A theoretically exists, but it's not reliably used, and certainly won't exist on any file produced more than a couple years ago.)

Now, with multi-modal LLMs and OCR reaching near-human levels, we can finally... attempt to infer structured data back out from them. So many megawatt-hours wasted in undoing what was just done. Structure to unstructure to structure again. Why, why, why.

As for universality... I mean, sure, they're better than some proprietary format that can only be decrypted or parsed by one old rickety piece of software that has to run in Win95 compatibility mode. But they're not better than JSON or XML if the source of truth is structured, and they're not better than Markdown or - again - XML if the source is mostly text. And there are always warts that aren't fully supported depending on your viewer.