Hacker News new | ask | show | jobs
Is supervised learning dead for computer vision?
38 points by baptiste1 972 days ago
Hey Everyone,

I’ve been diving deep into the world of computer vision recently, and I’ve gotta say, things are getting pretty exciting! I stumbled upon this vision-language model called LLaVA (https://github.com/haotian-liu/LLaVA), and it’s been nothing short of impressive.

In the past, if you wanted to teach a model to recognize the color of your car in an image, you’d have to go through the tedious process of training it from scratch. But now, with models like LLaVA, all you need to do is prompt it with a question like “What’s the color of the car?” and bam – you get your answer, zero-shot style.

It’s kind of like what we’ve seen in the NLP world. People aren’t training language models from the ground up anymore; they’re taking pre-trained models and fine-tuning them for their specific needs. And it looks like we’re headed in the same direction with computer vision.

Imagine being able to extract insights from images with just a simple text prompt. Need to step it up a notch? A bit of fine-tuning can do wonders, and from my experiments, it can even outperform models trained from scratch. It’s like getting the best of both worlds!

But here’s the real kicker: these foundational models, thanks to their extensive training on massive datasets, have an incredible grasp of image representations. This means you can fine-tune them with just a handful of examples, saving you the trouble of collecting thousands of images. Indeed, they can even learn with a single example (https://www.fast.ai/posts/2023-09-04-learning-jumps) And let’s talk about development speed. By using text prompts to interact with your images, you can whip up a computer vision prototype in seconds. It’s fast, it’s efficient, and it’s changing the game.

So, what do you all think? Are we moving towards a future where foundational models take the lead in computer vision, or is there still a place for training models from scratch?

P.S. Shameless plug: I’ve been working on this open-source platform called Datasaurus https://github.com/datasaurus-ai/datasaurus) that taps into the power of vision-language models. It’s all about helping engineers get the insights they need from images, fast. Just wanted to share some thoughts and start a conversation. Let’s talk about the future of computer vision!

10 comments

The places in which a vision model is deployed are different than that of a language model.

A vision model may be deployed on cameras without an internet connection, with data retrieved later; a vision model may be used on camera streams in a factory; sports broadcasts on which you need low latency. In many cases, real-time -- or close to real-time -- performance is needed.

Fine-tuned models can deliver the requisite performance for vision tasks with relatively low computational power compared to the LLM equivalent. Vision weights are small relative to LLM weights.

LLMs are often deployed via API. This is practical for some vision applications (i.e. bulk processing), but for many use cases not being able to run on the edge is a dealbreaker.

Foundation models certainly have a place.

CLIP, for example, works fast, and may be used for a task like classification on videos. Where I see opportunity right now is in using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset. (Disclosure: I co-maintain a Python package that lets you do this, Autodistill -- https://github.com/autodistill/autodistill).

SAM (segmentation), CLIP (embeddings, classification), Grounding DINO (zero-shot object detection) in particular have a myriad of use cases, one of which is automated labeling.

I'm looking forward to seeing foundation models improve for all the opportunities that will bring!

Thanks, James for your insights.

Your library looks nice.

You are right some computer vision systems do have real-time requirements and do need to be run on the edge. It is in the current roadmap of Datasaurus. I would like to capture logs of API calls and foundational model responses so that they can be later used as training data for smaller models.

I think your question should be "Will pretrained generative models replace training application-specific models from scratch?" instead of your focus on supervised learning.

The model you mentioned use supervised learning, and fine tuning also requires supervised learning (i.e. the text captions associated with the images).

Yes, true. Indeed, your phrasing is better! I am not sure however if I can change the title now. I will keep your comment in mind for the future.
Both approaches will co-exist for the foreseeable future. There's plenty of applications where foundation models trained on random internet images don't help much due to the specialist (or confidential) nature of the imagery and there's not so much point in training specialist foundation models due to the small amounts of available data.
Thanks for your comment. I agree, I think both method are quite complementary
I can understand that this works for cats or cars, but how does this work for images that are not in any training set yet? E.g. highly specialized images like x-ray pictures, or astronomy pictures?
As I understand it, the point is that these models while they are _trained_ on identifying cats or cars, because they have soon so much variation during training have internalised very different concepts to help come up with "its a cat". The idea then is to take all of these pre-trained weights that let you build this classifier, but then add your own custom head on the front of this network. This saves you doing a huge amount of training for what is essentially feature extraction - that part is already done. All you need to do is just add a bit more training that works out how to use these learnt features. I could be way off the mark, but that's how I understand it.
Yes, your understanding is correct. However, instead of adding a head on top of the network, most fine-tuning is currently done with LoRA (https://github.com/microsoft/LoRA). This introduces low-rank matrices between different layers of your models, those are then trained using your training data while the rest of the models' weights are frozen.
Foundational models are generally trained on internet scale level of data. They have seen billions of images, so they would have seen some medical images. For example, extracted from public datasets or textbooks. However, indeed, they may not be specialized to your use case. You could still fine-tune the model with a couple of examples to be more tailored to what you desire. Having a foundation model does not exclude training and your data could still be valuable. Indeed, you could achieve better performance by fine-tuning the larger model than just using your training data alone to train a model from scratch.

Also for the medical domain, I think vision-text segmentation models like SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) are really cool. You could for example ask “Where is the tumor located on that image?” and then the tumor is highlighted in the picture.

i had the same question, but re mri
My feeling is this pre-trained stuff won't work with electron microscopy, infrared camera or other types of specialized data+targets that aren't in the generical databases.
I think that is a great question and I think you are right. Generative models trained without supervision will replace discriminative models trained with supervision. But I think there are lots and lots of applications for generative models fine-tuned with labeled datasets. For example, I know of a construction company that needs to detect people wearing a certain type of worker vests in images. Existing models have no problem detecting people in images, but generally can't distinguish between these kinds of worker vests and normal clothes. Another company need to detect loose screw heads in engine blocks. Training models from scratch would require too large datasets, but fine-tuning existing models using perhaps hundreds of images should be doable. All the tech is there it's just not packaged well enough to be usable by normal developers. Massive business opportunities for any company that can solve this.
Thank you for your comment. Indeed, you are right not every company has terabytes of data to train their model. I like your example "Another company needs to detect loose screw heads in engine blocks”.

I actually got the idea for Datasaurus because of a similar problem. My brother wanted to check if sheets of metal were bent and needed to be rejected in a production line setting. However, he did not have any data and could maybe annotate a couple of images manually but not create a full dataset. We tested the fine-tuning approach and he was able to have good results in a couple of minutes.

That’s why I think this could be quite valuable and I decided to package it into an open-source application.

Which foundational model did you finetune with few images? just curious. I personally believe language interface or language conditioning is not very relevant or even harmful for many downstream CV applications. In your case, you don't need to ask whether the metal is bent in language interface, or there could be hundred ways you could ask these questions and outputs would be slightly different in each one. That's an unwanted instability, I feel conditioning inputs that were based on few examples would be much more relevant. i.e. Instead of conditioning with text embeddings, why not condition with embeddings of these 3 images and their labels?
I used LLaVA. Unfortunately, I signed a NDA :( so I cannot share the code and the data is private. We fine-tuned it with example images, labels, and text prompts. We also tried in-context learning. Indeed, the prompt was static but we could do data augmentation and provide a series of equivalent prompts. We just used the prompt that gave us the best performance during initial model testing with in-context learning. I am unsure if the existence of equivalent prompts creates instability because a sentence with the same meaning should be quite close in the latent space of the foundation model so it understands them in a similar manner.
We were always in this regime. In general, it has been far more effective to take a proven model and fine-tune it. It was true five years ago taking a convnet out of a model zoo to fine tune and it’s true now.

If you can achieve the moonshot of gathering, generating, annotating enough data with the right distribution to train from scratch, and you use a SOTA bag of tricks to regularise, you might do better.

Bear in mind fine tuning is literally just more pre-training. Starting with a trained model is like starting with an incredibly well-initialised network.

Yes, true the fine-tuning is not new and indeed I also view it as "starting with an incredibly well-initialized network"

However, the promotable aspects of those vision models are completely new. You can define your tasks at runtime and steer the model behavior. I think this makes it easier and faster to insights from your images. Lastly, those models are trained on a lot of different tasks compared to previous models that were general classifiers and that could then be trained on a specific domain. This allows them for example to be reused in an organisation and prevents you from creating multiple task-specific models

I’m guessing for more embedded, low power, or real time applications you’d still need to train a model?

I’d imagine you wouldn’t have the resources to run a souped up foundation model?

Thanks for your question. You are right, current vision-language foundation models are quite heavy. However, for example in NLP there are some works on smaller foundation models. In addition, you could also use a foundational model to help train your smaller model or label more data.
Betteridge's law of headlines - the answer is "no"

Pretraines LVMs can do many things, they are a powerful tool in our toolbox. But they are limited to the tasks they were pretrained on, and may come with subpar accuracy at scale or unknown biases that raise PR red flags.

LVMs also require expensive hardware to run, they are slow, and can be expensive to fine-tune.

I've worked on prod vision classifications models that run on cheap CPUs and even raspberry pis. For large scale companies, the difference can be $10k+ vs < $10 monthly cloud bills.

The other thing to consider is that collecting data for supervised learning can be fairly cheap. $5k spend on manual labeling is cheap compared to an engineer, and more importantly that can become a strategic IP advantage (there's no moat around open source-LVM applications).

If we have a use-case that LVMs support, it can be a good way to get to market faster. Once proven, I would seriously look at using the LVM plus human review to build a dataset for supervised training a cheap/fast/simple model from scratch.

> It’s kind of like what we’ve seen in the NLP world. People aren’t training language models from the ground up anymore; they’re taking pre-trained models and fine-tuning them for their specific needs.

This is false. Everything I wrote applies to LLMs.

Worked on computer vision. I agree with all this!
Thanks for your comment.

I did not know about "Betteridge's law of headlines", quite interesting. Thanks for sharing :)

You raise some interesting points.

1) Safety: It is true that LVMs and LLMs have unknown biases and could potentially create unsafe content. However, this is not necessarily unique to them, for example, Google had the same problem with their supervised learning model https://www.theverge.com/2018/1/12/16882408/google-racist-go.... It all depends on the original data. I believe we need systems on top of our models to ensure safety. It is also possible to restrict the output domain of our models (https://github.com/guidance-ai/guidance). Instead of allowing our LVMs to output any words, we could restrict it to only being able to answer "red, green, blue..." when giving the color of a car.

2) Cost: You are right right now LVMs are quite expensive to run. As you said are a great way to go to market faster but they cannot run on low-cost hardware for the moment. However, they could help with training those smaller models. Indeed, with see in the NLP domain that a lot of smaller models are trained on data created with GPT models. You can still distill the knowledge of your LVMs into a custom smaller model that can run on embedded devices. The advantage is that you can use your LVMs to generate data when it is scarce and use it as a fallback when your smaller device is uncertain of the answer.

3) Labelling data: I don't think labeling data is necessarily cheap. First, you have to collect the data, depending on the frequency of your events could take months of monitoring if you want to build a large-scale dataset. Lastly, not all labeling is necessarily cheap. I worked at a semiconductor company and labeled data was scarce as it required expert knowledge and could only be done by experienced employees. Indeed not all labelling can be done externally.

However, both approaches are indeed complementary and I think systems that will work the best will rely on both.

Thanks again for the thought-provoking discussion. I hope this answer some of the concerns you raised

Is this true even if you want to identify something in an image and gets its pixel coordinates?

Like say a pickleball.

Yes, you can. The model that I was talking about LLaVA only output text but other models such as SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) outputs a segmentation map. You could prompt the model "Where is the pickleball in the image?" and get a segmentation map that you could then use to compute its center. Please let me know if you would be interested to have SEEM available in Datasaurus