Thank you, this makes sense! As [1] puts it pithily
>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.
That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.