| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcuenod 466 days ago

Just tested with a multilingual (bidi) English/Hebrew document.

The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).

Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).

Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?

1 comments

nicodjimenez 466 days ago

You can get bounding boxes from our pdf api at Mathpix.com

Disclaimer, I’m the founder

link

kergonath 466 days ago

Mathpix is ace. That’s the best results I got so far for scientific papers and reports. It understands the layout of complex documents very well, it’s quite impressive. Equations are perfect, figures extraction works well.

There are a few annoying issues, but overall I am very happy with it.

link

nicodjimenez 466 days ago

Thanks for the kind words. What are some of the annoying issues?

link

kergonath 464 days ago

I had a billing issue at the beginning. It was resolved very nicely but I try to be careful and I monitor the bill a bit more than I would like.

Actually my main remaining technical issue is conversion to standard Markdown for use in a data processing pipeline that has issues with the Mathpix dialect. Ideally I’d do it on a computer that is airgaped for security reasons. But I haven’t found a very good way of doing it because the Python library wanted to check my API key.

A problem I have and that is not really Mathpix’s fault is that I don’t really know how to store the figures pictures to keep them with the text in a convenient way. I haven’t found a very satisfying strategy.

Anyway, keep up the good work!

link