Hacker News new | ask | show | jobs
by hakunin 1 day ago
IME way better. It may not be the best out there, but it's cheap (2c per page), fast, easy to integrate API, and sufficient for my needs. It does things like describe what's drawn in pictures and shown in graphs, which all helps when searching.
1 comments

great! have you tested how well it return coordinates of objects/text? Ive tried with generic LLMs like Gemini/Qwen/Gemma and they all are unstable with coordinates around text, better when using visual grounding though
Yeah, for perfect positioning/overlaying I would be much stricter with my requirements. For that type of OCR I used Apple’s own LiveText framework that comes with MacOS. But in this use case I only care about standalone plain text and descriptive text to store in the database, not overlay over original content, so never tested Mistral on that front.