|
|
|
|
|
by AnthonyR
32 days ago
|
|
Have you tried this task using an actual OCR model like Google Cloud Vision AI? I am not sure if this is what Gemini uses under the hood but multi-modal LLMs are not designed to extract text like this so it should be no surprise it's not good at it? |
|
You can find the explanation and the comparison in the article, which we benchmarked pure CNN models, pure LLM models and a hybrid architecture like ours.