Hacker News new | ask | show | jobs
Ask HN: Why are PDFs still heavily used?
19 points by pranshum 1205 days ago
PDFs are terrible to view on mobile. They have accessibility issues, navigation is painful, the content is static, and so on [1].

Web standards fix many of the problems. So why are PDFs still used, particularly for publishing reports?

1: https://www.nngroup.com/articles/pdf-unfit-for-human-consump...

24 comments

> the content is static

Why are you saying this like it's a bad thing? Why would you want dynamic content in a document format?

You can't give someone an epub and tell them to read page 197. You can't look at a webpage and know how it will look when you print it out. You also can't just netcat an ebook to your printer and get useful output.

What alternatives do we really have? PDF is the most widely supported format outside of plain text (arguably more widely supported than text actually, with phones being the major computing device). I haven't seen an ebook format that can correctly and consistently deal with code sections, inline images, quotes, font styling, paging, etc (maybe more a software issue than format problem, but if nobody can make it work right, the format isn't blameless).

PDF should be replaced... By a document format that displays statically rendered content in page size chunks, but with some of the backend problems removed.

Stuff like, embedding text in a way that improves accessibility and copying text works nicely. Being able to right click copy images would be also nice.

Improving accessibility may be fair (I'm not familiar with the details), but in many cases PDFs are used for scenarios where the authors don't want the individual assets to be trivially extractable. Even if an advanced user can do so programmatically, a static layout is often a feature, not a bug. If the authors wanted the reader to be able to quickly pull out assets and recreate it they would distribute the content in some other format.
That's dumb, and pointless. Any image or text that is displayed on the screen will always be easily copied. There's no security in making copying an image take five clicks instead of two, it's just annoying. Not annoying because the five clicks are difficult or very time consuming, annoying because "why should I have to click 3 extra times just because someone foolishly thinks their content is being protected when it isn't". The analog hole is always there as a fallback too.

If someone wants something to be private, they shouldn't publish it. Once you put something out into the world, you should expect that people will do whatever they want with it.

Read-Only Docx solves pretty much all these issues.
Not really. Even Word for Windows and Word for Mac don't always render them identically, and in Office 365 it may be something else! And when you add LibreOffice Writer and the like into the equation, all bets are off.
> You also can't just netcat an ebook to your printer and get useful output.

Can you 'just netcat' a PDF to a printer? Which port number? How could I control features such as printing to both sides of the page? I imagined there would be some wrapping protocol (or conversion to postscript?) Does the printer that receives just a PDF binary, and no other control signal, just decide to use a default mode (like always printing single sided only)?

Yes, it works on most printers. My Brother laser printer has a menu option to print out it's network settings, which gives ~20 different protocols with ports which can be used. I don't think you can control the print settings with netcat, but I'm probably wrong. Mostly I use lpr, but the netcat trick is very useful sometimes.
Fascinating, I printed a network report from an HP printer and it listed several services, one of which was:

    9100 Printing = Enabled
So I guessed that I could netcat to port 9100

    cat something.pdf | nc 192.168.1.2 9100
It worked! The printer sat waiting for more content, so I had to press Ctrl-C to kill netcat, but then it immediately printed what it had sent. The margins were off (not like if I had printed through a print driver).
You can if it supports PDF, which lots of modern networked printers do.
New and improved link rot: now updated to include document contents.
The consistency of the format. PDFs are pretty much the only realistic way to convey items like legal documents to a wide range of customers. I certainly wouldn't want to receive my legally-binding bank account documents as a pile of HTML soup that would render arbitrarily (or not correctly at all).

I also have a rule about reading PDFs. I almost universally print them off to actual paper if it's more than 2-3 pages worth of content. Printing websites is a non-starter.

Personally I love reading PDF files on my iPad (or cheap Android tablet for that matter.) I can load 100 books and even more technical papers and read them on the bus, spinning at the gym, in bed, curled up on the couch, as a passenger in my car, etc.

Posting papers from arXiv to HN I have a choice of pointing to the HTML abstract or the PDF and overall people express preference for the PDF and I believe I get more upvotes for the PDF.

I can see a phone being a little too small though. PDF does have features that make it possible to make a PDF which is reflowable like an HTML document for the sake of accessibility and better UX, but those features are complicated as hell and rarely used.

_all_ PDF features are complicated as hell apart from the "Export to PDF" feature of your word processor :D
My #1 complaint is that for all of the complexity there is no primitive to draw a circle in PDF, instead any ‘circle’ you see in a PDF is an approximation based on Bézier curves. (I thought I was uniquely untalented because I couldn’t draw anime characters with Bézier curves but when I looked at the resource packs for my favorite game I found my favorite illustrator couldn’t do it either.)

I did a deep dive into PDF because I was pitching and sketching out the ultimate test extraction system for corporate use and figured PDF was the most important format to support and I could say boy is it tricky to get the correct text out with the structure properly mapped unless some of those obscure features were in use and it would still be pretty hard then.

We tried to extract pseudo-code test procedures out of an ISO Standard and ended up developing heuristics of how to piece the text together based on the color of the syntax highlighting. Terrible, terrible format.

At the moment I'd like to add OCR text overlays onto scanned documents and just can get myself to deal with the format for longer than ten minutes until my brain gives up.

So, it seems a little strange now but when everything was a MS Word Document you had problems with the format. It would look amazing on your machine but if someone else opened it in a version higher or lower than yours or on a different machine things could look really funky and bad.

Thus, PDF was a game changer. Basically, no matter what machine you opened the file on it would look exactly the same. It kinda sucks you can't edit it or have to use special tools to parse it but when it comes to readability across devices it cant be beat.

Almost all of the complaints in the linked document boil down to "PDFs are not websites," which is ironic, considering that is the main benefit to PDFs.

A bunch of them are just dumb assumptions, like vague complaints about accessibility, ignoring the fact that PDFs have supported screen readers, alt text, etc for longer than the web has.

I laughed out loud at the assertion that PDFs are "stuffed with fluff" and the web is a model of focused, concise writing.

"Jarring user experience" and "Cause disorientation" are the same complaint rewritten to highlight different aspects of one problem.

Web sites and PDFs are different tools serving different needs. The complaint that PDF content is "static" is, confoundingly, both untrue and entirely the point. PDFs have support Javascript, media embeds, advanced navigation tools (hyperlinks, cross-refs), and more, for years and years... but most people don't encounter much beyond formfills and hyperlinks, because the rest of that stuff is not what people want out of a document format.

In other words, all the consultants can write all the words about how PDF should be the web, but the market has spoken, and PDFs are the way they are because they meet real needs. Figuring out what those are might be a better use of time than shouting into the wind.

Now that I think about it, why should I have to throw away thirty years of PDF-compatible software because nobody bothered to make a decent PDF reader for your phone?

I have a love hate relationship with PDFs, the hate mostly comes from parsing them. However, I think they are often the best thing for the job. A PDF is totally portable, your images and fonts are included in the file (I know you can do that with HTML but it is not how it is typically done), everyone can open them on every device, and you know when they do they will see exactly what you do, even if it may have a bit of an awkward aspect ratio. You also know it will look good printed out which is a huge pain to achieve using web standards. Most programs used by normal people (word/google docs, powerpoint/slides) have an intuitive export to PDF flow while a web page export, is usually more complicated and dissimilar to the document they originally created. There are whole categories of questions you need to answer about your formatting using web standards that are just not even a consideration in PDFs and most people don't want to need to think about that just to share a document.
In my experience PDFs are mostly used to ensure what you're seeing is what everyone else is seeing. If I sent out a Word or Powerpoint doc, I could be 90% sure someone would have a formatting issue. Maybe their default margins are different or they don't have the same fonts installed. It's a better format for the reader when they don't require edit access.

Google Docs + MS Office has probably figured this out by now, but there's also ton of historical momentum keeping it in use.

As a financial auditor I remember being baffled how people would go out of their way to create paper processes. A spreadsheet that requires confirmation would be printed, signed, stapled to another piece of paper - all so I can remove the staples and scan it to track digitally in our audit software. At least I knew sending them a PDF would come back looking the same...

If you can’t list their numerous advantages, I don’t think you should be asking for their removal.

https://en.m.wiktionary.org/wiki/Chesterton%27s_fence

PDFs are the only choice for portable read-only information. You click a button in Word and it exports with 100% fidelity exactly what you had on screen. Open it on a phone, or tablet, it retains its original size.

Web standards may have caught up, but nobody is sending anybody .mhtml files. I'm pretty sure most email providers would flag emails with such attachments as malicious.

Because most businesses/governments are solidly stuck in the "sheet of paper" paradigm, and PDF provides a way to "make electronic" all those sheets of paper.

And once the old paper based process has been "made electronic" by using PDF's, they don't bother with trying to continue past a "sheet of paper" metaphor.

Those sheets of paper have a really long history and humans have evolved it to be just the right size ... to handle and hold. And ergonomically two columns on a sheet of A4 (whatever by 11 in) is a pretty good default for the human visual system.

And ... printers work pretty well with it and while even I have to admit that digital documents are superior for most documents, paper still has a lot going for it: underlining, flipping back and forth, taking notes in the margins. I also find the added spatial dimension sometimes helps me remember (e.g. that was up on the top left of the page next to a chart) ...

The physicality of paper makes it substantially easier for my brain to map and relate content on the page. Each time a digital scroll or zoom is involved, my mental state machine experiences some corruption.
Wonder if there are any IDE's for book reading...
Paper is useful for the individual but using it in an organization context certainly causes problems.
- The content is normally static.

- They render consistently on any mainstream platform, without access to external resources.

- They load, scroll, and zoom relatively quickly on mainstream platforms.

I don't think PDFs are terrible to view on mobile. I think it works out pretty well. I prefer viewing them than the mess that is an HTML page with CSS. Usually whoever authored it ddin't do a good job of making it pinch/zoomable and accessible.

In the contract-negotiation world, signature versions of contracts are pretty much always Word documents that are saved to PDF and then "signed" electronically — a PDF document has more of a "feel" of immutability, even though altering a PDF is trivial for anyone with Acrobat Pro or equivalent, of course.
PDF content is consistent across devices.

The lack of changing on mobile is technically a feature.

It's the only format that I can know with relative certainty that will keep show the same visuals across many devices and platforms. That's very important.

Word/excel/powerpoint all do whatever they want.

I think the answer is in that PDFs are used to convey 90%+ static content offline in a way where you could open up 10-year-old PDF and it will still look exactly like it did the day it was created. Because of this is it the system of record for reports in business, government, academia, and other organizations.

The other use case are fillable forms that are legally binding. In larger business or the government where you have your identity tied to a private key, you can sign a PDF with it.

PDFs are easier to view on mobile than most websites containing the same range of content. Basically any graphic on a website is poorly displayed on mobile.
this is such a HN question...
MS Word's formats are worse.
Because of the power of metaphor.

A PDF is, in the popular consciousness, a digital form of paper. Anything that could be on paper? PDF.

Think of it as an FFI for the paper-based OS that our civ still runs upon (and may always.)

Like, network-based civ is kind of like a VM, and numerated paper documents (laws, etc.) are libc, and when you want to call out to the underlying reality, you use PDF.

It's a print format. They're meant to replicate the printed output.

Sometimes I wonder how in touch Nielsen is with the real world.

Jakob Nielsen has never been in touch with the real world.

For someone who talks about accessibility and usability, his personal website was always an absolute shitemare to read.

Printing to letter size paper is also not the only output.

Where I work we routinely send PDFs to print houses. They're large format, as in paper size. Honestly I can't imagine sending an html/css bundle to a factory. How would you even specify different varnish layers, or embossed areas?

Exactly! They alternative is a piece of paper, not "web standards".
What would you replace them with?

You can't use HTML, because it's totally inconsistent. You can't use Word docs, because they're totally inconsistent and not cross-platform.

What format can you think of that you can use to describe how each page looks, right down to fractions of a millimetre?

Because they will still be readable a hundred years or more from now. Your favorite format, whatever it is, probably won't be.
Get a 10" tablet or a laptop. It's painful to read any format on a phone.
They look great on a foldable phone, which is about 8 inches diagonally with a squarish aspect ratio.
Web standards don't fix any of the problems.

I hate reading web pages on my phone. PDFs are far superior, especially on tablet.