Hacker News new | ask | show | jobs
by Quarrelsome 2298 days ago
Any tricks for decimal points versus noise? Its a terrifying outcome and all I've got is doing statistical analysis on the data you've already got and highlighting "outliers".
4 comments

Change the decimal point in the font to something distinctive before rasterizing.
For something like bank statements, I'd use the rigidly-defined formatting (both number formatting and field position) to inform how to interpret OCR misfires. My larger concern then would be missing a leading 1 (1500.00 v 500.00), but checking for dark pixels immediately preceding the number will flag those errors. And I suppose looking for dark pixels between numbers could help with missed decimals too.
I've done this a bit. I define ranges per numeric field and if it exceeds or is below that range, I send it to another queue for manual review. Sometimes I'll write rules where if it's a dollar amount that usually ends ".00" and I don't read a decimal but I do have "00", then I'll just fix that automatically if it's outside my range.
(Novice speaking) Maybe there's something about looking for the spacing / kerning that is taken up by a decimal point? (Not sure if OCR tools have any way to look for this)