|
|
|
|
|
by aidos
1016 days ago
|
|
This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not. They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun). My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow. In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made. There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though). |
|
I think this is called masochist. Now, if you participated in writing the spec or were making others read it...