| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aidos 657 days ago

Mutool is the one I suggest to people. The easiest way to understand a PDF is to decompress it and then just read the contents.

    mutool clean -d in.pdf out.pdf

At that point you’ll realise that a PDF is mostly just a list of objects and that those objects can reference each other. After that you’ll journey through the spec understanding what each type of object does and what the fields in it control. The graphics stream itself is just a stack based co-ordinates drawing system that’s easy to follow too.

By way of an example. Here's an object that represents a Page. You can see the dimensions in the MediaBox. The contents themselves are contained at object "9 0 obj" ("9 0 R" is the pointer to it):

    2 0 obj
    <<
      /Type /Page
      /MediaBox [ 0 0 612 792 ]
      /Contents 9 0 R
    >>
    endobj

Meanwhile "9 0 obj" has the drawing instructions. They seem a little weird at first glance but you see the values ".23999999 0 0 -.23999999 0 792" each get pushed on the stack and then "cm" pops them to interpret them as the transformation matrix.

    9 0 obj
    <<
      /Length 18266
    >>
    stream
    .23999999 0 0 -.23999999 0 792 cm
    q
    0 0 2551 3301 re
    ...

The depth and detail of all of the different possible things that can be represented in a PDF is insane. But understanding the structure above is all you need to begin your journey!

EDIT The rest of your journey is contained in this epic document: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

1 comments

nicolodev 657 days ago

> mutool clean -d in.pdf out. pdf

My tool can do exactly the same (viewing internal structure, exporting objects, and see the uncompressed raw content for stream) with a graphical interface and without all this kind of flags (which one of the reasons I started to design this project with egui), but thanks for posting yours too.

link