Hacker News new | ask | show | jobs
by bjourne 972 days ago
I think that is a great question and I think you are right. Generative models trained without supervision will replace discriminative models trained with supervision. But I think there are lots and lots of applications for generative models fine-tuned with labeled datasets. For example, I know of a construction company that needs to detect people wearing a certain type of worker vests in images. Existing models have no problem detecting people in images, but generally can't distinguish between these kinds of worker vests and normal clothes. Another company need to detect loose screw heads in engine blocks. Training models from scratch would require too large datasets, but fine-tuning existing models using perhaps hundreds of images should be doable. All the tech is there it's just not packaged well enough to be usable by normal developers. Massive business opportunities for any company that can solve this.
1 comments

Thank you for your comment. Indeed, you are right not every company has terabytes of data to train their model. I like your example "Another company needs to detect loose screw heads in engine blocks”.

I actually got the idea for Datasaurus because of a similar problem. My brother wanted to check if sheets of metal were bent and needed to be rejected in a production line setting. However, he did not have any data and could maybe annotate a couple of images manually but not create a full dataset. We tested the fine-tuning approach and he was able to have good results in a couple of minutes.

That’s why I think this could be quite valuable and I decided to package it into an open-source application.

Which foundational model did you finetune with few images? just curious. I personally believe language interface or language conditioning is not very relevant or even harmful for many downstream CV applications. In your case, you don't need to ask whether the metal is bent in language interface, or there could be hundred ways you could ask these questions and outputs would be slightly different in each one. That's an unwanted instability, I feel conditioning inputs that were based on few examples would be much more relevant. i.e. Instead of conditioning with text embeddings, why not condition with embeddings of these 3 images and their labels?
I used LLaVA. Unfortunately, I signed a NDA :( so I cannot share the code and the data is private. We fine-tuned it with example images, labels, and text prompts. We also tried in-context learning. Indeed, the prompt was static but we could do data augmentation and provide a series of equivalent prompts. We just used the prompt that gave us the best performance during initial model testing with in-context learning. I am unsure if the existence of equivalent prompts creates instability because a sentence with the same meaning should be quite close in the latent space of the foundation model so it understands them in a similar manner.