| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by j-pb 1546 days ago

None of those virtues hold in practice. I've worked both at public library digitization efforts and machine learning companies that did document ingestion and analytics. You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent.

By separating the meaning from the visual representation there is no incentive to keep the invisible data workable.

PDF might as well be replaced with SVG, in terms of rendering consistency and metadata extraction capabilities. Because for a plain vector image format it's not that impressive.

1 comments

wolverine876 1546 days ago

If I understand correctly, your comment addresses PDFs created from scanning paper. PDFs at arXiv are converted from LaTeX inputs, per the OP, and not via scanning and OCR; therefore they contain perfect renditions of the text.

>> Searchable, copy-able, transmittable, and data is extractable. They are also an open format, don't rely on a central service to be available, and they preserve presentation across platforms. They have metadata, and are annotatable and reviewable. And the PDF format is the best for long-term preservation, carefully designed to be readable in 50 years - partly because they preserve presentation across platforms - and that includes the metadata, annotations, and reviews.

> None of those virtues hold in practice.

> You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent.

Which don't hold in practice? Are they not searchable? Is presentation not preserved? I use a lot of PDFs and they hold for me. PDFs are very popular, so they must work pretty well.

> SVG

Is there a standard way to do review and annotation, and is presentation preserved, for example when printing? Also, PDFs contain various image formats; do they contain SVG?

link

j-pb 1546 days ago

I'm not talking about scanned data. I'm talking about digitally born PDF, which have to be OCRed nevertheless because their text layer is unusable.

LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

LaTex is an angstrom accurate type setting system and it's great for that, but it's abysimal at producing digital formats.

Could it produce better PDF documents? Sure. Does it do so, and do package authors care about any other layer except the printed visual one? No.

All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly. Long term archival is super tricky and considered an unsolved problem by libraries.

There's a reason ArXiv stores the LaTex as the canonical representation and not the PDF. The source code is simply a better archival format.

link

wolverine876 1546 days ago

> LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

That's interesting. I must not deal with many LaTeX-based PDFs. The text in electronically-born PDFs I use is usually nearly flawless, with the exceptions of the bizarre extra space inserted between some words, and the challenge of hyphenated words on lines that no longer wrap in that spot.

> All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly.

I don't have your expertise, but I've heard a different story from librarians regarding PDF and particularly PDF/A.

link