great! have you tested how well it return coordinates of objects/text? Ive tried with generic LLMs like Gemini/Qwen/Gemma and they all are unstable with coordinates around text, better when using visual grounding though
Yeah, for perfect positioning/overlaying I would be much stricter with my requirements. For that type of OCR I used Appleās own LiveText framework that comes with MacOS. But in this use case I only care about standalone plain text and descriptive text to store in the database, not overlay over original content, so never tested Mistral on that front.