| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aidos 159 days ago

You can replace objects in PDF documents. A PDF is mostly just a bunch of objects of different types so the readers know what to do with them. Each object has a numbered ID. I recommend mutool for decompressing the PDF so you can read it in a text editor:

    mutool clean -d in.pdf out.pdf

If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj).

    1 0 obj
    <<
      /Type /Pages
      /Count 1
      /Kids [ 2 0 R ]
    >>
    endobj

    2 0 obj
    <<
      /Type /Page
      /Contents 5 0 R
      ...
    >>
    endobj

Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits.

    1 1 obj
    <<
      /Type /Pages
      /Count 2
      /Kids [ 2 0 R 200 0 R ]
    >>
    endobj

You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).

2 comments

pfisherman 159 days ago

Thanks for the technical explanation! This is pretty fascinating.

So it works kind of like a soft delete — dereference instead of scrubbing the bits.

Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?

link

clord 159 days ago

The intention is to make editing easy and quick on slow and memory deficient computers. This is how for example editing a pdf with form field values can be so fast. It’s just appending new values for those nodes. If you need to omit edits you’d have to regenerate a fresh pdf from the root.

link

SeriousM 159 days ago

To put it reaaaaaly simple, a PDF is like a notion document (blocks and bricks) with a git-like object graph?

link

aidos 159 days ago

Ha! As if anything about Notion is simple.

But yeah. It's all just objects pointing at each other. It's mostly tree structured, but not entirely. You have a Catalog of Pages that have Resources, like Fonts (that are likely to be shared by multiple pages hence, not a tree). Each Page has Contents that are a stream of drawing instructions.

This gives you a sense of what it all looks like. The contents of a page is a stack based vector drawing system. Squint a little (or stick it through an LLM) and you'll see Tf switches to Font F4 from the resources at size 14.66, Tj is placing a char at a position etc.

    2 0 obj
    <<
      /Type /Page
      /Resources <<
        /Font <<
          /F4 4 0 R
        >>
      >>
      /Contents 5 0 R
    >>
    endobj

    5 0 obj
    <<
      /Length 340
    >>
    stream
    q
    BT
    /F4 14.66 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <002B> Tj
    10.5842743 0 Td <004C> Tj
    ET
    Q...
    endstream
    endobj

I'm going to hand wave away the 100+ different types of objects. But at it's core it's a simple model.

link