| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JackC 246 days ago

Opinion from 10 years ago, I suspect still valid:

There are a million python libraries and tools to do some overlapping subset of the things you'd want to do with a pdf.

There are no doubt another million in other languages.

These are each basically bundles of some of the transformations you'd want to make to the same underlying data structure.

So, complex pdf scripts often need two or three different libraries to get their thing done, which is wasteful at borh a dev effort and computational level.

The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.

PDF libraries in any language could switch to using that structure and library internally, with the carrot that the switch would result in needing less code, and likely being some combination of faster and safer.

And then if they just exposed get_structure_pointer() and set_structure_pointer(), they could all interoperate for free. (Another carrot for joining -- small libraries could usefully add features and be adopted without needing to pick an existing popular library to glom onto.)

Not sure what would economically cause this to happen, but it would be great.

7 comments

layer8 246 days ago

When you write a PDF library, there are design trade-offs all the way down, depending on use cases. (Just “in-memory” is already an important design trade-off, because the PDF format is intentionally designed to not require the whole PDF to be loaded into memory at once.) It would also be antithetical to preferring deep modules with minimal interfaces over shallow modules with broad interfaces [0]. Lastly, in managed environments like the JVM, a C-interface library would come with additional complications and overheads.

[0] https://dev.to/gosukiwi/software-design-deep-modules-2on9

link

kccqzy 246 days ago

Ah that reminds me of the days when I was viewing a large PDF (some instruction manual that's hundreds of pages long) and the pages appear in the browser as soon as they are downloaded.

link

selcuka 246 days ago

Those are linearized PDFs [1]. Not all PDFs support streaming.

[1] https://developer.adobe.com/document-services/docs/overview/...

link

kmoser 246 days ago

> The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.

> Not sure what would economically cause this to happen, but it would be great.

Writing a library that is better than all the others is difficult to begin with. Continuing to upgrade and maintain it and fix bugs is even more difficult. Even with the right funding, you'd have to find someone who wants to keep at it year after year. When they inevitably lose interest, you'd have to find somebody else to take the reins--and weather the storm of complaints during the down time.

In short, thank you for volunteering to write and maintain this library for the rest of your life! :)

link

conradev 246 days ago

  The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.

https://github.com/J-F-Liu/lopdf

link

specialist 246 days ago

> someone made a great ... in-memory low level pdf reading and writing data structure

Are you suggesting Adobe's Core Object Application Programming Interface (COAPI) for PDF isn't sufficient?

Kidding!

I worked on print production software in the '90s. Stuff like image positioning (eg bookwork), trapping, color separations, etc. Adobe's SDKs, for both PostScript and PDF, were most turrible. For our greenfield product for packaging (printing boxes), I wrote a minimalist PDF library, supporting just the feature set we needed. So simple.

Of course, PDF is now an ever growing katamari style All The Things amalgamation of, oops, sorry I ran out of adjectives.

Back to your point: after URLs and HTTP, the DOM is the 3rd best thing spawned by "the web".

The DOM concept itself. Isomorphism between in-memory and serialized. That its all just an object graph. Composition over inheritance.

Not the actual DOM API; gods no.

I understand that API design is wicked hard. But how is it that of the Java tools, only JDOM2 (the sequel) managed to get the class hierarchy correct? So that incorrect usage is not permitted?

(I haven't looked at popular libraries for other languages. I assume they all also fell into the trap of transliterating JavaScript's DOM's API. Like dom4j and successors did.)

I'm just repeating your point (I think) that Adobe should have staked a strong starting conceptual position on PDF internals, what a PDF is. Something more WinForms and less Win32.

30+ (?!) years later, I'm still flubbergasted by PDF's success, despite Adobe's stewardship.

PS- And another thing...

For a print description language, I greatly preferred HP's PCL-5. Emotionally, it just feels more honest somehow. Initially, Adobe couldn't decide if PDF was for print control or documents. Customers wanted documents, so Adobe grudgingly complied, haphazardly.

At least "the web" had/has committees.

link

mannyv 246 days ago

"Adobe couldn't decide if PDF was for print control or documents"

Apparently people don't understand the history of PDF. PDF was originally a way to encapsulate PostScript so you could display it on a screen. Unlike PCL, Postscript (and PDF) were device-independent, with a WYSIWYG guarantee. Postscript and PDF are literally the history of WYSIWYG on personal computers and computer-based printing/typesetting.

PDF is not "print control" in the sense of a job control language. PDF has always been about documents, and the features of PDF files can be seen as an attempt by Adobe to both drive and follow the market's evolution of document handling.

PDF is complicated because it's used widely for lots of different things, including printing. And if you've never worked in the printing industry you have no idea how much of a PITA it is.

PDF succeeded for a lot of reasons, but probably the easiest explanation is that they were easier to create - you just printed it and the PDF printer driver spat out a PDF file that you could share everywhere.

link

sleepybrett 246 days ago

One of my first jobs was at an isp/web/cohost company. We had a big bank of modems for dialup customers, had some customers who terminated isdn with us, a rack of colocation and built websites as well.

The company was partially owned and housed primarily in a print shop, we worked above the press floor and I was sometimes pressed into service helping when we were slow (I had some experience working in a print shop in highschool (helping with pagemaker and helping to run the big hidleberg), similarly in college.

Nothing like ending your day writing perl cgi scripts and troubleshooting customers damn winsock configurations and then going home and coughing up whatever color was running on the presses that day.

link

tingletech 246 days ago

I had an early job with an ISP that was similar, had modems in people's garages all over the county since this was when calling local could get expensive. The ISP was in the back of a computer store though. Once an ISP customer came into the store. I was just answering phones in the back room, but they sent me to the floor to talk to the customer. I was wearing sandals, and the sales manager fired me on the spot for being on his floor with sandals. The person who I really reported to tried to hire me back when he found out that sales manager had sent me home and fired me.

link

whizzter 246 days ago

Actually debugging a PDF parsing issue as we speak and actually started writing a parser (partially to understand the issue, partially as a last resort as the code in the parser I was debugging felt a bit shoddy).

The PDF format is frankly quite horrible, extended over the years by kludges that feels more or less like premature optimizations in some cases and bloated overkill in others.

While theoretically a nice idea, the issue is that there is just so many damn object types with specialized properties inside a PDF that you'd basically end up with all complications of a FFI for each binding you'd do to expose a sane subset.

Theoretically one could perhaps make a canonical PDF<->JSON or similar mapping from an established library that most PDF data consumers/generators could use if memory usage isn't too constrained (because the underlying object model isn't entirely dissimilar).

link

whenc 246 days ago

You can do:

  cpdf -output-json in.pdf -o out.json

(Modify out.json as liked)

  cpdf -j out.json -o out.pdf

(Disclaimer, I wrote it.)

link

whizzter 244 days ago

Seems cool for document usage, the online JS version however thrashed the digital signatures with that rotate 10 degrees demo (not entirely if it was just a checksum issue but it seemed to be worse as in tinkering with or not roundtripping the signature data object).

link

zehaeva 246 days ago

I don't think this _really_ contributes to the conversation, but I think we can sum this entire post up with just one XKCD comic.

https://xkcd.com/927/

link

wmichelin 246 days ago

ffmpeg but for PDFs

link