Hacker News new | ask | show | jobs
by xyzxyz998 3275 days ago
That was informative, thank you.

I think a lot of people in the dev/power uesr community would mind paying $1 for a Kindle ebook where you note all your findings.

There have been so many instances where I wanted to do stuff with pdfs but ended up deflated.

> subset fonts

So you mean if a font has been embedded with three glyphs, 0x41=A, 0x61=a, 0x62=b, then string Aba would be \1\3\2?

4 comments

That's correct. As a sibling has said, there other ways to do it but most the pdfs I need to work with are done by simply remapping in order of occurrence. (E.g., if an X is the first char in the doc, it's referenced as \1). You can tell subset fonts because they're named as RANDPREFIX+fontname so different subset fonts from the same base font won't collide.

You can get a good overview of the state of the fonts in your PDF using:

    pdffonts file.pdf
There's a column which tells you if there's s Unicode map available for the font. That's important. Because PDF is just rendering glyphs at positions, it doesn't even know what the character names are. To allow you to copy and paste, most fonts in most pdfs will have a Unicode map from the glyph id to the Unicode symbol.

If that's not available, in some cases you can rebuild it yourself by looking at the character encodings and substitutions.

On the book, do you have any examples? I'll probably never get around to writing anything down, but if it looks easy enough it's probably worth having a stab at.

Also, large caveat, I'm not a PDF or font expert. I've probably decimated the terminology here but hopefully it gives you a rough idea.

> a Kindle ebook

I think you mean "a handcrafted pdf"?

The PDF reference is freely available and pretty readable too. I would recommend just read that.

To answer your question, subsetting a font just means taking a portion of its glyphs and it doesn't imply remapping. In fact for almost sane PDF files you will find ASCII characters mapped to themselves, making text search within decompressed PDF possible. My dirty watermark remover script basically uses qpdf to decompress the thing and then use regular expressions to search for Tj or TJ right after the specified string.

This is a copy of the ISO 32000 PDF specification:

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/p...

This is a long document but it is very well written, if you read it on the bus or while you're waiting for your compiler to finish, you will get to understand it.

Thanks for sharing!
Adobe used to publish and distribute the pdf spec on their developers site. Used to be able to read it and hand code PDFs. Not sure if such a resource is still available.

Wish I still had a copy but it was a while back.

The spec is still available freely: http://www.adobe.com/devnet/pdf/pdf_reference.html

For historical interest, older versions (going back to 1.3) are here: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

1.2, 1.1, and 1.0 can be found elsewhere on the Internet.

Sweet, thanks. Haven't been on the Adobe devnet in a while