They do a have a vision decoder like many other LLMs, so in theory it should be able to write the positions textually, then call a render command, then look at the rendered bitmap. I's all very opaque though; I'd love a visualisation of the latent space data that it's converting the image to. I found that very long vertical images throw Opus off completetely for example. It's very interesting to experiment with this. Let it play with placement and let it call a render command. Then let is describe in detail what it sees. I'll be looking into this a lot this year. Maybe there will be niche models that will be smaller but have better vision capabilities then Opus. A world where one model rules would be incredibly depressing (kinda like what we saw with some software companies since the 90s).