|
|
|
|
|
by j-pb
1546 days ago
|
|
None of those virtues hold in practice. I've worked both at public library digitization efforts and machine learning companies that did document ingestion and analytics. You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF.
Everything else is often wrong, broken, or non-existent. By separating the meaning from the visual representation there is no incentive to keep the invisible data workable. PDF might as well be replaced with SVG, in terms of rendering consistency and metadata extraction capabilities.
Because for a plain vector image format it's not that impressive. |
|
>> Searchable, copy-able, transmittable, and data is extractable. They are also an open format, don't rely on a central service to be available, and they preserve presentation across platforms. They have metadata, and are annotatable and reviewable. And the PDF format is the best for long-term preservation, carefully designed to be readable in 50 years - partly because they preserve presentation across platforms - and that includes the metadata, annotations, and reviews.
> None of those virtues hold in practice.
> You always OCR the PDF visuals to get the text, because that's the only thing reliable about PDF. Everything else is often wrong, broken, or non-existent.
Which don't hold in practice? Are they not searchable? Is presentation not preserved? I use a lot of PDFs and they hold for me. PDFs are very popular, so they must work pretty well.
> SVG
Is there a standard way to do review and annotation, and is presentation preserved, for example when printing? Also, PDFs contain various image formats; do they contain SVG?