Hacker News new | ask | show | jobs
by Dylan16807 4698 days ago
The only thing that's missing is a mapping from 'symbol #28' into 'ascii #63'. Internally it's storing instances of symbols plus font data for those symbols.

Also, something to think about: an EBCDIC document accidentally printed as ASCII/8859-1 would have equally zero semantic meaning when fed into an OCR program. But I don't think anyone would argue it wasn't OCR.

1 comments

That "only thing that's missing" is a very very big thing, and difficult to get correct. And where does it say it's storing font data for the symbols?
A font doesn't need to be anything more than a series of bitmaps. And then each character location on the image, ignoring errors, references one of these bitmaps. That's how documents with embedded bitmap fonts generally work.

That mapping isn't a very big thing. Sometimes text-based PDFs don't even have it, and you don't notice unless you try to copy out and get the wrong letters.