Hacker News new | ask | show | jobs
by lifthrasiir 816 days ago
This JBIG2 "myth" is too widespread. It is true that Xerox's algorithm mangled some numbers in its JBIG2 output, but it is not an inherent flaw of JBIG2 to start, and Xerox's encoder misbehaved almost exclusively for lower dpis---300dpi or more was barely affected. Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary), and this or similar incident wasn't repeated so far. So I don't feel it is even a worthy concern at this point.
1 comments

1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

> Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary)

Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

The "specifics" here is exactly what the whole topic is focus on, so you can't really gloss over it.

> 1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

It is relevant only when you assume that lossy compression has no way to control or even know of such critical changes. In reality most lossy compression algorithms use a rate-distortion optimization, which is only possible when you have some idea about "distortion" in the first place. Given that the error rarely occurred in higher dpis, its cause should have been either a miscalculation of distortion or a misconfiguration of distortion thresholds for patching.

In any case, a correct implementation should be able to do the correct thing. It would have been much problematic if similar cases were repeated, since it would mean that it is much harder to write a correct implementation than expected, but that didn't happen.

> Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

Traditional compressions simply didn't have much computational power to do so. The "blurry blob" is something with lower-frequency components only by definition, and you have only a small number of them, so they were easier to preserve even with limited resources. But if you have and recognize a similar enough pattern, it should be exploited for further compression. Motion compensation in video codecs were already doing a similar thing, and either a filtering or intelligent quantization that preserves higher-frequency components would be able to do so too.

----

> 2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

I admit I have generalized too much, but the choice of scan resolution is highly specific to contents, font sizes and even writing systems. If you and your company can cope with lower DPIs, that's good for you, but I believe 300 dpi is indeed the safe minimum.