| HN Mirror

I followed this guide for fine-tuning: https://ai.google.dev/gemini-api/docs/model-tuning

Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic

Same problem with ق (qaf) written as ف (fa) in old Arabic

And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these

My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence