I used this just-released API (of Gemini Pro) with multimodal input to test some of the things from the infamous Gemini Demo. You can see here [ https://www.youtube.com/watch?v=__nL7Vc0OCg ] my GPT-4 recreation of that ad which went viral.
Gemini Pro is... not great. In one test, I asked what gesture I was making (while showing a thumbs up) -- it said thumbs down and "The image is a commentary on the changing nature of truth".
I think the fair comparison would be GPT3.5 (if image inputs were supported) vs Gemini Pro. It would be great to compare this with Gemini Ultra next year.