|
Hi HN, I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown. Some features:
• Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision
• Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English)
• Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.)
Would love to hear any feedback or ideas for improvement. GitHub: https://github.com/ses4255/Versatile-OCR-Program |
Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...
I'd want to see some proper hallucination analysis.