| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by j-pb 1546 days ago

I'm not talking about scanned data. I'm talking about digitally born PDF, which have to be OCRed nevertheless because their text layer is unusable.

LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

LaTex is an angstrom accurate type setting system and it's great for that, but it's abysimal at producing digital formats.

Could it produce better PDF documents? Sure. Does it do so, and do package authors care about any other layer except the printed visual one? No.

All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly. Long term archival is super tricky and considered an unsolved problem by libraries.

There's a reason ArXiv stores the LaTex as the canonical representation and not the PDF. The source code is simply a better archival format.

1 comments

wolverine876 1546 days ago

> LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.

That's interesting. I must not deal with many LaTeX-based PDFs. The text in electronically-born PDFs I use is usually nearly flawless, with the exceptions of the bizarre extra space inserted between some words, and the challenge of hyphenated words on lines that no longer wrap in that spot.

> All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly.

I don't have your expertise, but I've heard a different story from librarians regarding PDF and particularly PDF/A.