Hacker News new | ask | show | jobs
by hyiltiz 2062 days ago
It is a pity that DjVu[0] wasn't even mentioned; an open format that was superior to PDF in many ways[1], including better optimization, efficient storage.

[0] http://djvu.org/ [1] https://en.wikipedia.org/wiki/DjVu

4 comments

DjVu is a great format for scanned images, which is its primary use-case, but I'm not seeing where you can have actual, selectable text in a DjVu document, like you can with PDF and PostScript. It seems like it's all images.
> 3.3.2 Hidden text

> Every DjVu image optionally includes a hidden text layer that associated graphical features with the corresponding text. The hidden text layer is usually generated by running Optical Character Recognition software. This textual information provides for indexing DjVu documents and copying/pasting text from DjVu page images.

I copied that text from the DjVu spec, which is in the DjVu format.

I have not read the specification, but the DJVu format must have a way to store the plain text besides the images and that way is frequently used.

I do not remember ever reading a DJVu file that did not allow searching and selecting the text, while PDF files which do not allow those, because they store only the scanned images, are quite frequent.

Man, I haven't seen a DjVu file in years. It used to be somewhat common in scans of magazines and other media that relied on images. Pity that it didn't catch on, although I suppose it still could, if some of its benefits were refined. I find that larger PDFs tend to tank optimization, is that a problem for DjVu files at all?
I don't see how DjVu solves vector graphics, which is a pretty important usecase for PDF.
It's crazy that Yann LeCun was involved in the creation.
Yes, Yann and another machine learning celebrity: Leon Bottou!