| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by svat 700 days ago

This is cool!

Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:

- https://pdf.hyzyla.dev/

- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)

- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)

- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)

- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)

- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)

- https://www.o2sol.com/pdfxplorer/overview.htm

More?

6 comments

aidos 700 days ago

Mutool is the one I suggest to people. The easiest way to understand a PDF is to decompress it and then just read the contents.

    mutool clean -d in.pdf out.pdf

At that point you’ll realise that a PDF is mostly just a list of objects and that those objects can reference each other. After that you’ll journey through the spec understanding what each type of object does and what the fields in it control. The graphics stream itself is just a stack based co-ordinates drawing system that’s easy to follow too.

By way of an example. Here's an object that represents a Page. You can see the dimensions in the MediaBox. The contents themselves are contained at object "9 0 obj" ("9 0 R" is the pointer to it):

    2 0 obj
    <<
      /Type /Page
      /MediaBox [ 0 0 612 792 ]
      /Contents 9 0 R
    >>
    endobj

Meanwhile "9 0 obj" has the drawing instructions. They seem a little weird at first glance but you see the values ".23999999 0 0 -.23999999 0 792" each get pushed on the stack and then "cm" pops them to interpret them as the transformation matrix.

    9 0 obj
    <<
      /Length 18266
    >>
    stream
    .23999999 0 0 -.23999999 0 792 cm
    q
    0 0 2551 3301 re
    ...

The depth and detail of all of the different possible things that can be represented in a PDF is insane. But understanding the structure above is all you need to begin your journey!

EDIT The rest of your journey is contained in this epic document: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

link

nicolodev 700 days ago

> mutool clean -d in.pdf out. pdf

My tool can do exactly the same (viewing internal structure, exporting objects, and see the uncompressed raw content for stream) with a graphical interface and without all this kind of flags (which one of the reasons I started to design this project with egui), but thanks for posting yours too.

link

desgeeko 700 days ago

I am the author of PDFSyntax, thanks for mentioning it!

The HTML output is like a pretty print where you can read view objects and follow links to other objects.

Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...

link

mistrial9 687 days ago

python3 -m pdfsyntax: error: argument command: invalid choice: 'inspect' (choose from 'browse', 'disasm', 'overview', 'text')

link

nicolodev 700 days ago

Thanks for the list, the idea behind my tool was to try to code something that might fit an analyst that would take a fast look at the PDF. I'm also trying to figure out some fast heuristics to mark/highlight some peculiar stuff on the file itself.

Now regarding the tools you mentioned, I haven't checked out all of them, but part of them are interesting (and more mature, speaking of testing and compatibility). However some (at least the ones I was trying) are very basic, and they don't allow the "Save object as.." or uncompress it. I like the feature of displaying the PDF for preview :)

link

mananaysiempre 700 days ago

The venerable PDFedit[1] more or less forces you to confront the internal structure of the PDF file as well.

[1] http://pdfedit.cz/en/index.html

link

richardw 700 days ago

Recommend just letting people have their one day in the sun. We’ve become less the site of builders as the red team for testing your launch.

link

nicolodev 700 days ago

yeah I agree, and while everyone is suggesting tools which are really good but I designed mine to get rid of the flags and CLI interface. Good for tech people that keeps remembering flags, I'm not :(

link

whizzter 700 days ago

Sweet, currently working on PDF signature stuff so I'm sure I'll find some stuff handy :)

link