Hacker News new | ask | show | jobs
by throwaway314155 206 days ago
There aren’t any YOLO models for captioning and the other models aren’t robust enough to make for good embedding models.
1 comments

You can get labels out of the classifier and bounding box models.

They are super fast.

Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.

Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.