|
If I understand correctly, your comment addresses PDFs created from scanning paper. PDFs at arXiv are converted from LaTeX inputs, per the OP, and not via scanning and OCR; therefore they contain perfect renditions of the text. >> Searchable, copy-able, transmittable, and data is extractable. They are also an open format, don't rely on a central service to be available, and they preserve presentation across platforms. They have metadata, and are annotatable and reviewable. And the PDF format is the best for long-term preservation, carefully designed to be readable in 50 years - partly because they preserve presentation across platforms - and that includes the metadata, annotations, and reviews. > None of those virtues hold in practice. > You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent. Which don't hold in practice? Are they not searchable? Is presentation not preserved? I use a lot of PDFs and they hold for me. PDFs are very popular, so they must work pretty well. > SVG Is there a standard way to do review and annotation, and is presentation preserved, for example when printing? Also, PDFs contain various image formats; do they contain SVG? |
LaTex is one of the worst offenders when it comes to producing mangled text layers. Multi column text is often stored with both columns interleaving, or not at all. Verbatim is mangled, formulas are a hot mess. The text order between paragraphs is not preserved.
LaTex is an angstrom accurate type setting system and it's great for that, but it's abysimal at producing digital formats.
Could it produce better PDF documents? Sure. Does it do so, and do package authors care about any other layer except the printed visual one? No.
All those extra features you mention make your "still readable in 50 years" requirement go out the window pretty quickly. Long term archival is super tricky and considered an unsolved problem by libraries.
There's a reason ArXiv stores the LaTex as the canonical representation and not the PDF. The source code is simply a better archival format.