Hacker News new | ask | show | jobs
by svat 989 days ago
This is great, thanks for sharing!

It is also inspiring to me, because I have had the same idea and been working on something like this on-and-off since early 2022, but the contrast between your project and the state of mine[1] is like a textbook example of how to ship vs how not to ship:

• Instead of using an existing parser like pdf.js as you do, I started writing my own parser from scratch, in the process learning Rust and its Nom library for parsing, its integration with Webassembly, etc.

• I wrote not just a straightforward parser, but a crazy one that that preserves all the details like whitespace etc (what a typical parser is supposed to ignore), so that I can test whether it round-trips successfully.

• After I got it working, I didn't stop at "works on almost all PDFs in practice" (the same as with PDF.js or any other PDF implementation) but actually chased down and investigated every single failure, checking whether they work in any other PDF application/library (Preview, Chrome, qpdf, Adobe Reader, etc), until I could prove to my satisfaction that it's not a fault in the parser. (This is still not complete…)

• When I returned to this project again after several months, instead of making further progress I spent time starting to document the code, making minor improvements and tweaks, etc.

So the end result is that my project basically does nothing still, while you have a working PDF debugger. :) This is the difference between a project that intends to actually produce something and one that ends up being mostly for learning/fun with the goal mostly forgotten… not that I have any regrets :)

[Meta: Something similar is true of this comment too, which I started two days ago but left as a draft… until I finally had a burst of energy and posted just now.]

Returning to your project, a couple of feature requests:

- Provide a shortcut to jump directly to the node for page N, for any user-provided page number N.

- (Where possible) Some annotation of the page content stream operators — the Tj, Td, etc.

(Do consider making it an open-source project, whatever the quality of the code…)

[1]: https://github.com/shreevatsa/pdf-explorer / https://shreevatsa.net/pdf-explorer/

1 comments

Thanks for sharing your story! My goal was to have MVP as fast as possible; otherwise, I could lose interest in it. It is the biggest reason why I chose to use an existing parser instead of writing my own (I've initialized an empty Rust project on my OS for that )

Few things that I have in nice-to-do features list, but hard to implement without writing my own parser: - edit nodes (with XREF table update) - raw source editor - show actual position in source

For editing, I was able to make some simple edits (not of individual objects, but things like removing or duplicating pages, or editing crop boxes) using pdf-lib instead of pdf.js: see for example (just right-click and "view source") https://shreevatsa.net/pdf-pages/ and https://shreevatsa.net/pdf-unspread/

For seeing the raw source, after using such things for a bit (e.g. the output HTML file generated by https://github.com/desgeeko/pdfsyntax which is very good), I'm starting to feel it's nice to look at the first few times / in some cases, but in the long run / for large PDFs, maybe it's not really so useful or worth it.