Hacker News new | ask | show | jobs
by zerojames 972 days ago
The places in which a vision model is deployed are different than that of a language model.

A vision model may be deployed on cameras without an internet connection, with data retrieved later; a vision model may be used on camera streams in a factory; sports broadcasts on which you need low latency. In many cases, real-time -- or close to real-time -- performance is needed.

Fine-tuned models can deliver the requisite performance for vision tasks with relatively low computational power compared to the LLM equivalent. Vision weights are small relative to LLM weights.

LLMs are often deployed via API. This is practical for some vision applications (i.e. bulk processing), but for many use cases not being able to run on the edge is a dealbreaker.

Foundation models certainly have a place.

CLIP, for example, works fast, and may be used for a task like classification on videos. Where I see opportunity right now is in using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset. (Disclosure: I co-maintain a Python package that lets you do this, Autodistill -- https://github.com/autodistill/autodistill).

SAM (segmentation), CLIP (embeddings, classification), Grounding DINO (zero-shot object detection) in particular have a myriad of use cases, one of which is automated labeling.

I'm looking forward to seeing foundation models improve for all the opportunities that will bring!

1 comments

Thanks, James for your insights.

Your library looks nice.

You are right some computer vision systems do have real-time requirements and do need to be run on the edge. It is in the current roadmap of Datasaurus. I would like to capture logs of API calls and foundational model responses so that they can be later used as training data for smaller models.