Hacker News new | ask | show | jobs
by giampaolo44 1737 days ago
I agree. Have been experimenting with headers/bold recognition so far, and it's promising.
1 comments

I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.

Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.

By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.

Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.