| HN Mirror

Yes, and if you look at the "blue cube on a red cube beside a yellow sphere" example, it's clear that there are other areas where it simply lacks the semantic basis to get a request that needs to be correct in a non-image sense right. It knows letters, and that letters come in sequences related to things it might paint, but it has no very good dictionary mapping those sequences to things; it knows how to draw a cube, and a sphere, but the semantics of "on" and "beside" are largely absent.

I don't think that is terribly surprising, nor a very cogent detraction from the model.