| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aidenn0 680 days ago

Huh, I tried with the version from pip (instead of my package manager) and it completes in 22s. Output on the only page I tested is considerably worse than tesseract, particularly with punctuation. The paragraph detection seemed to not work at all, rendering the entire thing on a single line.

Even worse for my uses, Tesseract had two mistakes on this page (part of why I picked it), and neither of them were correctly read by EasyOCR.

Partial list of mistakes:

1. Missed several full-stops at the end of sentences

2. Rendered two full-stops as colons

3. Rendered two commas as semicolons

4. Misrendered every single em-dash in various ways (e.g. "\_~")

5. Missed 4 double-quotes

6. Missed 3 apostrophes, including rendering "I'll" as "Il"

7. All 5 exclamation points were rendered as a lowercase-ell ("l"). Tesseract got 4 correct and missed one.