By "lossy" you mean "17" may look like a crappier "17" with reasonable confidence, but will never, ever, become "21" at any compression setting, then I don't mind lossy. That's not asking for too much, is it?
But the scanner isn't starting off with "17" (as in two ascii characters) it is starting off with a bit mapped image that your brain happens to interpret as the number 17. It is too much to ask that a lossy image compression algorithm never result in a compressed bitmap that your brain interprets exactly as the original.
Having various compression/quality options allows you to pick the tradeoff (file size/resulting quality) that is acceptable for your situtation. There is no perfect setting for all situations. Even the original bitmap is an imperfect (i.e. lossy) rendering of the original document.
It seems a bit too coincidental that images to which human beings assign semantic value are being transformed into images to which human beings assign different semantic value.
I don't expect the scanner to have any semantic awareness of the document content, so when I hear "lossy compression", my expectation is "image may become illegible", and not "image may remain legible, but become inaccurate".
This is hacker news -- I don't expect everyone to know how jbig2 or other compression scheme works. But before you insinuate that the scanner has semantic awareness of the document and is altering that meaning in a less-than-coincidental way, I would hope that you could have a cursory look at how such compression works.
The issue only involves small letters, because the compression scheme breaks up the image into patches and then tries to identify visually similar blocks and reuse them. Certain settings can allow for small blocks of text to be deemed identical, within a threshold, and thus replaced. That's all. Coincidence, not semantic awareness.
Hence the advisory notice to use a higher resolution -- smaller block sizes.
> The issue only involves small letters, because the compression scheme breaks up the image into patches and then tries to identify visually similar blocks and reuse them. Certain settings can allow for small blocks of text to be deemed identical, within a threshold, and thus replaced. That's all. Coincidence, not semantic awareness.
Copiers very commonly copy printed material. This sort of algorithm makes it likely that sometimes one character will be replaced by another, so it is a bad algorithm for the job.
I'm aware - I'm merely responding to the previous commenter's point about how the compression algorithm is "starting off with a bit mapped image that your brain happens to interpret as the number 17", and pointing out that if this were the case, the likely outcome should be a fuzzier-looking "17" and not a "21".
Clearly, the compression algorithm is designed around human perception (i.e. looking for visually-similar segments to, I assume, tokenize), and therefore does relate to the actual semantics of the document, albeit in a coarse and mechanical way. It did know enough to replace character glyphs with other character glyphs, but didn't know enough to choose the right ones.
My point is that it's not coincidental at all - this algorithm is obviously in a sort of "uncanny valley" in its attempt to model human visual perception.
It's not a coincidence that the thing that looks most like a blurred number is another blurred number.
A document will be covered in numbers, and the compression algorithm looks for similar blocks it can re-use; the side effect is sometimes it says "that blurry 4 looks pretty close to this blurry two, so I'll just store that block once and reuse it"
The problem is that this is a minor side effect to a programmer and an absolutely massive issue to an end user that no-one had thought of previously, and now we all have to be worried that all our scanned documents might be incorrect. (just because this was found in fuji-xerox scanners doesn't mean other brands don't also have the issue)
Let's take that premise to OCR then. This whole debacle started with JBIG2 settings that I guess duplicated(?) one section and inserted it where similar text exists. Only it was marginally similar.
According to Adam (https://news.ycombinator.com/item?id=6156418) this is a known problem that Xerox, who call themselves document people for crying out loud, should have known and compensated for.
I don't believe that e.g. JPEG compression can ever take a clearly readable 17 and turn it into a clearly readable 21. Lossy compression implies changing the data, obviously, but it does not have to imply changing the data in that particular way.
By default, sure, for people who don't know about these settings, they are setting themselves up for failure when a few numbers get transposed. They should ship the device with highest quality settings so your customers are impressed out of the box, instead of being disappointed that they have to tune the device to get a high quality copy. If you must save a few bytes in this age, then you can turn on these crazy settings as you like.
I can just see a legal loophole now for anyone using these devices, for example "the electronic document was modified by a Xerox and we don't have the original, those numbers were not what we signed, contract void".
No matter the case of an optional setting or the size of the font involved, this can have major consequences for people who trust the device to be an accurate representation in all cases, of what they put into it.
Of course! Lossy compression is tolerable for cat videos on YouTube where it doesn't matter if a few details are wrong. It is absolutely not tolerable for storage of important documents. This is something that should go without saying.
Looks like the company is trying to weasel out of it and there are going to have to be lawsuits. Though I didn't really expect otherwise; if the dice come up badly, the damage from this could exceed the net value of the company.