| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TerraHertz 4521 days ago

Some other problems with the prize task as defined:

* Wikipedia is light on images (due to copyright issues I suppose) but the wider challenge of compressing 'human knowledge' really has to deal with the vast masses of paper books. Which are fundamentally images (before they are OCR'd and compressed.) Also since this is really an exercise in preservation of historical record you really want to preserve the exact original visual appearance, blemishes and all, which means the true image must be retained along with the OCR text. Additionally there other recording media- photos, film, etc to capture.

* When you include analog original formats (of which ink on paper is one) then the whole idea of 'lossless' compression is moot. What you're really after is compression with no _perceptible_ content loss.

* Which means that the issue of 'acceptable quality' is crucial. And deciding what is acceptable quality loss for different forms of source material is something that will require very good AI.

For instance, images that are printed with mixes of offset printed screening (those tiny dot patterns) and solid ink edges (text, hard lines, etc) is very difficult to compress. You can't just blur everything to reduce the screening dots to an even gradient, because that ruins the edge definition of the text and lines. You can't scan at a lower or similar resolution to the dots, since that produces horrible moire patterning. You can't scan and save the image at a high enough resolution to capture the dots exactly, since that makes the filesize HUGE.

So your AI has to actually 'understand' the image to some extent, and smooth out just the screened areas, with precise but hard to identify mask edges. If you've ever done this by hand in photoshop, you'll know how hard it is.

That is actually a problem I'm stuck on with some historic technical manuals I'm trying to make digital copies of atm. If anyone knows of an automated way to blur offset screening (but not other printed elements on the page) to evenly shaded tones, please say so. See: http://everist.org/archives/scans/query/image_processing_que...

Another question: are there any freeware compression utils, that can do the RARbook trick? ie pass them a file.jpg, and they'll ignore the type extension and just scan through the file for a valid archive header then unpack from there on. As WinRar does. So frustrating that WinZip, ALzip etc don't do this.