Hacker News new | ask | show | jobs
by netdur 462 days ago
I have documents from the last 50 years that I need to digitalize, millions of them written in old Arabic. The OCR is not accurate due to handwritten documents, so I need to fine-tune a model on around 300k pairs of texts (OCR output and manually corrected versions)
1 comments

This sounds very interesting; can you share more? Thanks!
I followed this guide for fine-tuning: https://ai.google.dev/gemini-api/docs/model-tuning

Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic

Same problem with ق (qaf) written as ف (fa) in old Arabic

And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these

My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence