| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wolverine876 1549 days ago

> the PDF is not a format fit for sharing, discussing, and reading on the web. PDFs are (mostly) static, 2-dimensional and non-actionable objects. It is not a stretch to say that a PDF is merely a digital photograph of a piece of paper.

It is too far a stretch, murdering the poor subject:

PDFs are the best format available for long-term information, such as research papers. They have the advantages of digital data: Searchable, copy-able, transmittable, and data is extractable. They are also an open format, don't rely on a central service to be available, and they preserve presentation across platforms. They have metadata, and are annotatable and reviewable. And the PDF format is the best for long-term preservation, carefully designed to be readable in 50 years - partly because they preserve presentation across platforms - and that includes the metadata, annotations, and reviews.

PDFs are like paper in that they will look the same 50 years from now as they do today, unlike (almost?) any other digital format.

Yes, I wish they were a bit more dynamic in layout, and that the text was more cleanly extracted.

3 comments

rhn_mk1 1548 days ago

> data is extractable

That's true for plain text (in the best case), but try extracting an equation, table or a diagram.

Stepping away from best case, PDFs in theory look the same everywhere, but turn into a mess on buggy implementations or differing rendering engines – due to the insistence on having a stable presentation, they assume positining and sizing always works, so when that fails, it fails worse than a buggy rendering of a presentation-agnostic document like an HTML page.

(In my experience, bugs either enter just before printing, or when displaying using JS-based renderers).

link

wolverine876 1547 days ago

> That's true for plain text (in the best case), but try extracting an equation, table or a diagram.

Good point. From what format are tables, diagrams, and formulas extractable (while retaining format)? I've had good luck moving tables between my web browser and email applications, though it always surprises me that the html is implemented similarly enough.

> PDFs in theory look the same everywhere, but turn into a mess on buggy implementations or differing rendering engines

I don't deal with PDFs programatically, and it sounds like you might, but from the user end, and from running networks of thousands of users, I've hardly ever seen problems in practice except for the browsers' JavaScript renderers.

link

cxr 1548 days ago

Of all the properties you mention, that PDFs "preserve presentation across platforms" is the only one that isn't shared with responsibly wielded HTML (e.g. the sort of thing that Zotero produces when stashing a local copy—which uses SingleFile under the hood). It's also the one property that is net undesirable—being more liability than benefit.

Being sent a PDF of an academic paper to read (or do anything with other than send it to a printer) is about ten times lower on the user preference scale than having someone send a link to a blog post on the same subject. (The other reason for that being that when people are in the mode that involves writing an academic paper, they forget how to write anything that anyone would actually want to read. Most academic writing sucks.)

Of the properties you listed that PDF does share with self-contained HTML, on the other hand, there isn't one that PDF isn't worse at—not even "transmittable". (Initially I would have put them on the same level there, but of course that's wrong. When you're in an environment where for whatever reason a file copy is not an option, PDF's binary format makes it harder to transmit the bytestream than HTML.)

Who cares if a PDF looks the same everywhere if that means everyone who encounters it bounces away rather than having to slog through any attempt to actually read it?

link

wolverine876 1547 days ago

> responsibly wielded HTML

That would be fantastic, but there are no available solutions that meet the specs I listed, including long-term preservation and annotation (what annotation subsystems are there for HTML?). ePub is 'responsibly wielded HTML', but it lacks annotation and long-term preservation is iffy.

I much prefer PDFs to blog posts, personally - they are mine, I can annotate them, etc. Also, I find much more thought is put into a PDF than a blog post (which both beat Twitter!).

link

cxr 1547 days ago

> there are no available solutions that meet the specs I listed

We're going in circles. "Preserve presentation across platforms" is an anti-feature. No one has created a solution that satisfies that constraint because it's (a) a lot of work for (b) something that is the opposite of what the people involved are actually aiming for.

If you're preparing material for print and it's important to be able to represent the exact printed layout (e.g. to print again), then PDF makes sense. If printing doesn't appear in the pipeline twice or even once, then PDF is very, very bad.

> I much prefer PDFs to blog posts, personally - they are mine, I can annotate them, etc.

You can do that with blog posts.

> I find much more thought is put into a PDF than a blog post

I don't. I find, as I alluded to before, that there's much less thought put into trying to express things clearly and economically. Instead, that concern is replaced with a concern for writing in a way that sounds "academic" but is painful to read.

PS:

> ePub is 'responsibly wielded HTML'

Not at all. EPUB is very irresponsibly designed. "The format works in my browser today" should have been the #1 sanity check on that workgroup's output. They failed.

link

wolverine876 1546 days ago

> "Preserve presentation across platforms" is an anti-feature. No one has created a solution that satisfies that constraint because it's (a) a lot of work for (b) something that is the opposite of what the people involved are actually aiming for.

I think you are missing the experiences of a large part of the user population. They put together their report, or book, or brochure or datasheet or whatever, and they want it to look a certain way, regardless of the platform, and they almost all use PDF. People care very much about how their work product looks. PDF is a solution that satisfies that constraint - I have seen it do that very consistently for a long time.

How do you annotate blog posts, and in way that is preserved for decades.

I almost suspect we are somehow talking different things, because even the most non-technical users know that about PDFs. But also it reads like you are finding a way to disagree with everything.

link

cxr 1546 days ago

> I think you are missing the experiences of a large part of the user population.[...] People care very much about how their work product looks.

I'm aware such people exist, I just don't confuse that fact with a belief that a majority of readers don't have problems with the experience of e.g. trying to read a PDF on a phone or anything else that isn't at least A4-/letter-sized (heck, PDFs have fewer people read them than would otherwise even when the person doing the reading is using a desktop or laptop)—precisely because PDFs preserve the presentation for print.

> How do you annotate blog posts, and in way that is preserved for decades.

Is that a statement or a question? In any case, it's hard to begin to conceptualize what sort of misunderstandings about the relevant media could lead to either. Blog posts (published originally as HTML, that is) are not inherently less susceptible to being saved than an academic article published as PDF. I did, however, already mention Zotero (and SingleFile).

> I almost suspect we are somehow talking different things, because even the most non-technical users know that about PDFs. But also it reads like you are finding a way to disagree with everything.

Am I? I'm pretty sure that I understand what you're saying, at least, and that we're talking about the same things. I don't know what you expect, though, when the fundamental premises are in dispute. There's no way to just "yes-and" through disagreements like that.

link

wolverine876 1545 days ago

> I'm pretty sure that I understand what you're saying, at least, and that we're talking about the same things. I don't know what you expect, though, when the fundamental premises are in dispute. There's no way to just "yes-and" through disagreements like that.

If you start from the premise that you know everything, there's not much to talk about. The basis of such interactions is to try to learn from the other person, be intellectually curious.

> Is that a statement or a question? In any case, it's hard to begin to conceptualize what sort of misunderstandings about the relevant media could lead to either.

That the is the language of someone trying to fight - about document formats!

link

j-pb 1548 days ago

None of those virtues hold in practice. I've worked both at public library digitization efforts and machine learning companies that did document ingestion and analytics. You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent.

By separating the meaning from the visual representation there is no incentive to keep the invisible data workable.

PDF might as well be replaced with SVG, in terms of rendering consistency and metadata extraction capabilities. Because for a plain vector image format it's not that impressive.

link

wolverine876 1548 days ago

If I understand correctly, your comment addresses PDFs created from scanning paper. PDFs at arXiv are converted from LaTeX inputs, per the OP, and not via scanning and OCR; therefore they contain perfect renditions of the text.

>> Searchable, copy-able, transmittable, and data is extractable. They are also an open format, don't rely on a central service to be available, and they preserve presentation across platforms. They have metadata, and are annotatable and reviewable. And the PDF format is the best for long-term preservation, carefully designed to be readable in 50 years - partly because they preserve presentation across platforms - and that includes the metadata, annotations, and reviews.

> None of those virtues hold in practice.

> You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent.

Which don't hold in practice? Are they not searchable? Is presentation not preserved? I use a lot of PDFs and they hold for me. PDFs are very popular, so they must work pretty well.

> SVG

Is there a standard way to do review and annotation, and is presentation preserved, for example when printing? Also, PDFs contain various image formats; do they contain SVG?

link

j-pb 1548 days ago

I'm not talking about scanned data. I'm talking about digitally born PDF, which have to be OCRed nevertheless because their text layer is unusable.

LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

LaTex is an angstrom accurate type setting system and it's great for that, but it's abysimal at producing digital formats.

Could it produce better PDF documents? Sure. Does it do so, and do package authors care about any other layer except the printed visual one? No.

All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly. Long term archival is super tricky and considered an unsolved problem by libraries.

There's a reason ArXiv stores the LaTex as the canonical representation and not the PDF. The source code is simply a better archival format.

link

wolverine876 1547 days ago

> LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

That's interesting. I must not deal with many LaTeX-based PDFs. The text in electronically-born PDFs I use is usually nearly flawless, with the exceptions of the bizarre extra space inserted between some words, and the challenge of hyphenated words on lines that no longer wrap in that spot.

> All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly.

I don't have your expertise, but I've heard a different story from librarians regarding PDF and particularly PDF/A.

link