Hacker News new | ask | show | jobs
by jabron 204 days ago
What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.
1 comments

I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.