Hacker News new | ask | show | jobs
by bob1029 507 days ago
> they failed miserably at finding bounding boxes, usually just making up random coordinates.

This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.

1 comments

The spatial awareness is what grounding models try to achieve, e.g. UGround [1]

[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python