Hacker News new | ask | show | jobs
by bilsbie 972 days ago
Is this true even if you want to identify something in an image and gets its pixel coordinates?

Like say a pickleball.

1 comments

Yes, you can. The model that I was talking about LLaVA only output text but other models such as SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) outputs a segmentation map. You could prompt the model "Where is the pickleball in the image?" and get a segmentation map that you could then use to compute its center. Please let me know if you would be interested to have SEEM available in Datasaurus