| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by KMag 1250 days ago

A PDF file is a program for a virtual machine that draws characters. For instance, I believe fonts in PDF work like PostScript fonts, where (for left-to-right languages) each glyph in the font is actually a bytecode function that starts with the brush in the lower-left corner of where the glyph is to be drawn, draws the glyph, and leaves the brush at the lower-left corner of where the next glyph is to be drawn. I think it's somewhat similar to turtle graphics, if you're familiar with Logo programming or G-code if you've ever hand-coded a CNC mill. (PostScript is text instead of bytecode. PDF is an odd mix of a binary and text format, which helps explain why it has had so many parsing security vulnerabilities over the years.)

For common cases, it may be possible to basically decompile the PDF, modify the text, and re-flow the text, and re-compile to bytecode. However, it's very complicated to do in the general case. (Note that in HTML, the browser determines how to best layout the text, but with PDF, the PDF generator makes the layout decisions.)

Also, many PDF renderers will "compress" fonts by lazily building up an embedded font as glyphs are used in the document. These typically will assign "a" to the first glyph used "b" to the second, etc., so if you decompile "This is some text", you'll see "abcd cd defg hgih". Some PDF generators will helpfully annotate the generated text with "backing text" metadata to help screen readers/copying-to-clipboard, but it's far from universal. So, you might need a database of hashes of all of the bytecode functions in a large number of fonts and/or some image-to-text software in order to reliably decompile the PDF.

If you're unable to copy text out of a PDF or you get gibberish when you copy text from the PDF, it's likely because the PDF lacks this "backing text" metadata (and in the gibberish case, likely a compressed embedded font). Some scanners will helpfully perform OCR to add this backing text metadata to the generated PDF.

Source: I did a small amount of work related to PDF analysis in Google's web search indexing pipeline over a decade ago. Most of my work was related to figuring out how JavaScript altered web page text, but I did learn just enough about PDF to be dangerous. At the time, Yahoo was Google's biggest competitor, and tons of their indexed PDFs had preview text that was this compressed font "abcd cd de..." garbage. Yahoo obviously naively decompiled the PDF and just trusted that "a" in the embedded font was a bytecode function that drew the glyph "a".