Hacker News new | ask | show | jobs
by marvinkennis 1145 days ago
Seeing a lot of text-to-image out there recently. Does anyone know what the current state of the art is on image-to-text? Thinking something similar to Midjourney's /describe command that they added in v5
2 comments

While it's not publicly available yet, I have strong suspicions that multimodal GPT-4 may actually be SOTA in image-to-text. The examples shown in the Sparks of AGI paper were extremely impressive imo, though of course those are cherry-picked so it's unclear how well the model will perform on non-cherry-picked images.
This is text + image -> text but pretty cool and still might be of interest to you:

https://llava-vl.github.io

Just entering "Describe this image" in the chat prompt got me exactly what I was looking for. Thanks!