Hacker News new | ask | show | jobs
by sbierwagen 938 days ago
>spatial component is hard to describe in text

Idefics, mingpt4, Next-GPT and LLava are open source multimodal LLMs that can read images.

1 comments

Yes, but do they get a vague idea of what's onscreen or could they really see what's going on in each tile, keep track of all the stats and use those to inform their decisions?