|
|
|
|
|
by lingjiekong
471 days ago
|
|
Curious that have people find more details regarding what is the architecture of this "mistral-ocr-latest". I have two question that 1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result. 2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison. |
|