| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ks2048 1023 days ago

This seems to be missing an important point: at the end of PDF is a table ("cross-reference" table) that stores the BYTE-OFFSET to different objects in the file.

If you modify things within the file, typically these offsets will change and the file will be corrupt. It looks like in this article, maybe they were only interested in changing one number to another, so none of the positions change.

But, generally, adding/removing/modifying things in the middle of the file require recomputing the xref table and thus become much easier to use a library rather than direct text editing.

4 comments

gpvos 1023 days ago

That's why they decode it with qpdf and re-encode it again afterwards, so qpdf takes care of that. qpdf reconstructs the original PDF structure, and I think it even tries to keep the object numbers the same, but the offsets are recalculated completely.

link

userbinator 1023 days ago

That's the weirdest part of the PDF spec IMHO. It's a mix of both binary and text, with text-specified byte offsets. It would be very interesting to read about why the format became like that, if its authors would ever talk about it. My guess is that it was meant to be completely textual at first (but then requiring the xref table to have fixed-length entries is odd), and then they decided binary would be more efficient.

link

detourdog 1023 days ago

I actually was at a Acrobat/PDF launch event in midtown NYC. It was an embedded file type that could be generated at the type of publishing and all dependencies could either be embedded or not.

This made a coherent point in a digital workflow that could be saved and reprinted with ease. This was a big deal before the portable document format came to be.

I once made a workflow that took pdf files from Word, filemaker, excel, and mini-cad. This all got combined into a single 9,000 page pdf. The final pdf had a coherent thumbnails, page numbers and headers and footer.

Only took a couple of hours to get the final documnet after pushing the go buttton.

link

Someone 1023 days ago

> My guess is that it was meant to be completely textual at first

It indeed started life as “not Turing complete postscript with an index” (those makes it easy to render just the third page of a PDF file, something that’s impossible in postscript without rendering the first and second pages first). Like postscript, that was a pure text format.

One nice feature is that you can append a few pieces and a new index to an existing PDF file and get a new valid PDF file (which would still contain its old index as a piece of “junk DNA”)

I think compression was added because users complained about file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85) grows binary data by 25%.

> but then requiring the xref table to have fixed-length entries is odd

My guess is that made it easier to hack together a tool to convert PDF to postscript.

link

pmarreck 1023 days ago

The roots of PDF are PostScript, which is like Forth, and is text-based, so that’s why

link

aidos 1023 days ago

In my experience it’s easiest just to break the xref table and run something like “mutool clean” to fix it again. It can be completely derived from the content so it’s safe to do.

link

bena 1023 days ago

Ah. So it's a lot like editing compiled binaries.

You can modify binaries all you want as long as you preserve the length of everything.

Some piece of software we had authenticated against a server, but everything was done on the client. The client executed SQL against the server directly, etc. Basically, the server checked to see if this client would put you over the number of licenses you purchased and that's it.

I had run it against a disassembler, found the part where it performed the check, and was able to change it to a straight JMP and then pad the rest of the space with NOPs.

link