Hacker News new | ask | show | jobs
by alister 2064 days ago
I'm thankful PDF won, because otherwise I think it would have been Microsoft Word. There was a time when papers, books, resumes, contracts, etc. almost always came as Word. Does anyone else remember getting a book as preface.doc, chap1.doc, chap1a.doc, chap2.doc, subchap2a2.doc, and so on, and a mess of jpegs and gifs and trying to figure out how it had to be assembled, and discovering something was missing, or that one chapter was newer than the others. That's one reason I really like PDF -- it's one file, self-contained, and linear.

On the other hand, I really wish it was more diff'able. If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level. I know that PDF diff tools exist, but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure.

10 comments

PDF objects within the file are usually compressed. That means if anything changes, the whole compressed binary blob changes.

Other than compression and such encodings, PDF files are actually text files, with a drawing model largely based on PostScript but without the programming. If you want to diff them, use `mutool clean -d -a` to first turn them into pure ASCII text.

That said, since it's a "baked" layout format, if one word pushes the rest of the text forward, everything after that will show up with changed coordinates. It's closer to a vector image format like SVG than a markup format like HTML or ODF.

There are also things like font subsetting, where removing a word that was the only use of a character, or adding a word that uses a new character, might change the font data to add/remove those characters.

"but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure."

Imagine a simple file format that doesn't support text wrapping, but allows you to specify elements as (x, y, s) where (x, y) specify a position, and s is a string that will be written left-to-right, truncated at the edge of the screen.

That's a simple file format, right?

But inserting a word somewhere early in that document would change the string within every element in the rest of the paragraph. And maybe move the y position of every element later in the document.

That would be a PITA to diff. Even more so if the document has more than one column.

The irony is when I get told that people want an application to output PDF instead of Word, because it is read only.

I always get amused by proving those people how to edit PDFs.

It is the same logic that documents sent by Fax are legally binding but the same document sent by email not.

As someone that does this, it’s because they are harder to change for the average user, and that barrier also means ‘don’t do it’ in a soft sense. If I send a PPT you are almost saying to a client “you can edit this if you want, because I provided it in a format which is designed for editing rather than a format that was designed for view-only”.

99% of the time it stops the “Oh great, so when you took that presentation we prepared for you as a consultancy, you kept our logo on it but changed the content and also removed our caveats!”

Also it stops people seeing my personal notes and comments that I have included throughout the document if it’s a ppt, and it also stops me sharing the data behind any graphs which is normally internally stored in the ppt file. You can set a ppt as view only, but nobody does it and clients don’t like it.

Yes - the dangers of sending a PPT to someone are real... your theme, your logo, someone else’s words. Unless a person is setup specifically to edit PDFs, there are limited tools to do so. Change to many words and spacing and alignment gets off. And it’s annoying.
The right way to do this is via digitally signed PDFs. The signature is invalidated if the document is edited (other than adding a signature).

Disclaimer: I work for Adobe, but not directly on Document Cloud.

Wait until you get a PDF which is just a bunch of poorly scanned JPGs and no OCR.
That is why one pays for Adobe Acrobat.
>I'm thankful PDF won, because otherwise I think it would have been Microsoft Word.

Well, probably Microsoft XPS, which was actually a fairly well designed format. But Microsoft didn't have the fight in them to really push it as a competitor to PDF. In part, I suspect b/c it's hard to justify investing a lot of money in your competing document standard as there is not much revenue you can derive from it. As of 2018, Microsoft no longer bundles XPS support in Windows 10.

XPS happened during the Microsoft era, which means no one really wants another format dictated by them. So there was very little incentives, interest and adoption.

XPS ultimately became an Open Standard as Open XML Paper Specification. But the fear and burn during IE era were far too great.

Interestingly it still exists inside the print spooler, it's the default spool format for modern printer drivers.
They had also RTF which was one of the best formats created by Microsoft.
I don't have an opinion on how good a format RTF is, but I kind of like it.

As part of a Java project (a while ago), I studied the RTF format, partly by reading the spec, and partly through reverse engineering - by creating multiple incrementally larger RTF docs, starting from zero content, then adding a word, then a font style, then a paragraph, a table, etc. And after each addition, opened the RTF in a hex editor and viewed the content, to help decipher the format rules. Then wrote a small RTF generation library in Java, that we used in the project to programmatically generate reports from DB data fetched via EJB. I also provided some ability to vary content and style independently. Good fun.

Just out of curiosity, why the hex editor? Isn’t RTF just ASCII?
You are right, it is just ASCII text. Probably a brain fart there, sorry. I may have said that (hex editor) out of sheer habit of using one to inspect various formats. Or I may have used a text editor, if so TextPad, IIRC, since the project was on Windows (at least dev env was). It was years ago, so not sure.
Aight :)
In what ways is it preferable to ODT (OpenDocument format)?
sidpatil answered (sibling comment). The RTF spec was freely available (on MS's site and/or MSDN CDs then) and should still be around, at least on some sites, since RTF is still used a lot as an exchange format between word processors and even other software. So you can read the spec; it is straightforward.

RTF is to Word like CSV is to Excel. In fact, we generated RTF because the output was to be input to Adobe InDesign.

RTF is a much simpler format than ODT. RTF source code resembles TeX at first glance; ODT is based on XML.

Unfortunately, it's not an open standard AFAICT.

It's not a standard in the sense of an ISO-style organization approval, but RTF has been thorougly documented by Microsoft for a very long time. [0]

[0] https://interoperability.blob.core.windows.net/files/Archive...

Right, I think it still may be a de facto standard from MS. It was so then.
The main thing for me is I don't need the originating application or fonts. I have twenty year old PDF files created by some long gone software that I can still read.
PDFs act more like images than text. I made a tool for diffing PDFs at the visual level a little while ago (http://parepdf.com) because I needed a way to see the explicit differences between PDFs.

Diffing PDFs at the textual level is a much harder problem though since lines of text need to be reordered and concatenated with each other. Unfortunately there is nothing built into the format that allows you to know what line belongs with what other line beyond guesswork.

That's a very nice tool!

I attempted something similar (https://nicediff.com), and found the textual approach to be basically useless:

Tax form example: https://www.nicediff.com/view/7a5f41ba3c76ae9bb45f42a4faa8b6...

We'd probably be using PostScript and maybe later XPS. Word never had a print-oriented format with exact layout.
I don't know of a better alternative to PDF that was around at the time, but I can't say I'm a fan. It undeniably works well as a way of placing pixels precisely on a page but then so does PNG, and PNG is far simpler and compresses better for computer generated content.

Sadly some information I only get as PDF's, so I have to scrape them. Easy right? It can be, if the PDF is structured sanely. But PDF isn't some well defined data structure for laying out the page, it's a Turing complete stack based computer program that can do whatever it damned well pleases. The font tables don't necessarily have ' '=32, 'A'=65, 'a'=97. Why not optimise it and get rid of all those gaps, so now ' '=0, 'A'=30? And it doesn't have to be drawn in any sane order. It can be just a mess that makes even copy & paste near impossible, and some are.

Did we really need to invent a DSL that has to be executed every time we wanted to view page? I remember it being pushed as a cool solution at the time. It doesn't look so cool now. SVG would be an improvement.

PNG doesn't support multiple pages and didn't supplant GIF until 2000 or so. TIFF does, but in practice it's always uncompressed (did it even support compression in the 90s?). Either solution didn't allow for text blocks or vector zooming or form fields.

It's not difficult to improve upon a sane subset of PDF, but that would require backing and coordination. Reviving XPS (but not under MS auspices) should also be possible.

PNG is of course an image format, and that means it doesn't really do text well. (Oh, and PDF is fully Turing complete and can even execute JavaScript, to in some contexts calling it a DSL is straining the definition a bit.)
> If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level.

Convert to text with: pdftotext -layout

> because otherwise I think it would have been Microsoft Word.

No, those are formats with completely different scopes. They don't compete and are essentially non-interchangeable.

> There was a time when papers, books, resumes, contracts, etc. almost always came as Word.

There was never such a time. I mean, sure, you could (and can) send people Word/LibreOffice documents, but things that needed some reproducibility and finality [1] were distributed or published is MS-Word format - almost ever. Postscript used to be pretty popular though.

[1] - Yes, PDFs can be edited too, I know.