Hacker News new | ask | show | jobs
by ekidd 480 days ago
I've been experimenting with vlm-run (plus custom form definitions), and it works surprisingly well with Gemini 2.0 Flash. Costs, as I understand, are also quite low for Gemini. You'll have best results with simple to medium-complexity forms, roughly the same ones you could ask a human to process with less than 10 minutes of training.

If you need something like this, it's definitely good enough that you should consider kicking the tires.

2 comments

BTW Check out the Gemini qualitative results here in our hub: https://github.com/vlm-run/vlmrun-hub?tab=readme-ov-file#-qu....

It gives you an idea of where today's models fail (Gemini Flash, OpenAI gpt4o+mini, open-source ones like Llama 3.2 Vision, Qwen VL 2.5 etc).

Very cool! If you have more examples / schemas you'd be interested in sharing, feel free to add to the `contrib` section.