For local inference, sure, but we simply lack the computing power to train them on all the images and html content that is available in the internet and books. That will happen sometime in the future, though.
Ah right, sorry, you were making a much more interesting point than my reply! I read "UI development" and jumped to the conclusion that the point was just about inference-time modify-test cycles. Yes, agreed, if they trained on images, or even better (?) on (code, image) or (code-delta, image-delta) pairs, they would surely be better at UI development.