Hacker News new | ask | show | jobs
by svat 1308 days ago
This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)

In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.

1 comments

  Commiting a change (from Jun 5 20:57) that I don't understand any more 

     // From real life, lightly modified. Note the "/companyName, LLC" as key!
With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`
I'm the author and I just meant I had left behind a small uncommitted diff back when I stopped working on it, and I didn't bother to read the diff before committing. I actually understand it just fine, on second look…

Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.