Hacker News new | ask | show | jobs
by hyborg787 4694 days ago
This has nothing to do with OCR. It's an issue with the JBIG2 compression re-using similar patches as substitutes for certain areas of the images if they're "close enough". This issue is exacerbated at lower resolutions.
2 comments

The process that's in use and what you consider to be true OCR are very similar, differing mainly in the last step (where OCR maps to a standard character set, but JBIG2 creates one on the fly along with a corresponding font). However, the errors at issue arise from part of the process where JBIG2 and OCR are doing pretty much the same thing, so even if the analogy is flawed, it is still highly instructive. Saying this has nothing to do with OCR is quite simply wrong, since this is clearly very closely related with OCR even if it doesn't meet your exact (unspecified) definition of OCR.
Thats OCR...
Well, it doesn't go all the way, at least in this implementation (contrary to Xerox's statement, we've been told compression is not standardized), to actually recognize the symbols it finds. If it did, it would presumably make many fewer of these errors, maybe almost none since when it's uncertain it could just go with the original.
O-SubC-R then perhaps. Still is recognizing shapes/symbols which is the very basis of OCR.

This seems a bit hair splitty when the end result is the same as invalid OCR dictionaries.

Well sure, but then why don't we just call it "lossy GZIP"? OCR is a pretty specific subset, and produces characters - this does not produce computer-readable characters, therefore not OCR.
What are you on about? What does it produce if not computer-readable characters? Computer illegible characters? Are you saying it cannot read from the dictionary it creates? Or from the characters it is later optically recognizing off that dictionary?

Again from the JBIG2 wiki[1]:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded.."

It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason, its implementation in this case is an absolute failure.

[1] http://en.wikipedia.org/wiki/JBIG2

Does it produce ASCII? UTF? If no, it's not OCR.

edit: by the definition you seem to be going on, any facial recognition is also OCR, since you could consider a face a 'glyph' (edit: 'symbol'). The only 'text' thing here that I can see is that it is intended to be used on text, which lends some optimizations, nothing that it's actually text-based in any way.

It produces symbols, not characters.

Say that the scanner internally splits the scan into regions of 10x10 pixels that it saves in memory. If another region differs on less than (say) 10% of the pixels it is assumed that the two zones are identical and the first one is used in the second place too. The regions have no semantic meaning.

OCR translates the scan into a character set.

OCR per definition gives out text. Not binary data that resemble the bitmap of the input image.
OK down-voter. Read the JBIG2 wiki[1].

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded, typically also using context-dependent arithmetic coding, and the regions are encoded by describing which symbols appear where."

Then from the OCR wiki[2].

"Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition"."

Furrow your brow and smash the down-vote arrow all you wish. It won't stop JBIG2 from doing much of what people consider OCR as doing today. Recognizing characters, just JBIG2 adds in making it's own dictionary which opened the path to this topic today.

[1] http://en.wikipedia.org/wiki/JBIG2 [2] http://en.wikipedia.org/wiki/Optical_character_recognition