This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.
[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python
[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python