Hacker News new | ask | show | jobs
by Kubuxu 895 days ago
You could consider something like BLIP2. There are multiple ways you could use it: embed images and match them against embeddings of text descriptors, train custom embedding of descriptors, or a classifier on top of the embedding (linear layer on top of the image embedding network).

The approaches increase in complexity. It also allows for dataset bootstrap:

Let's say you want to classify cats by breed. You could start by embedding images and text descriptors and distance-matching the embedded descriptors to the images. This gives you a dataset that might be 90% correct; you can then clean it up, which would be easier to do than manually labelling it. Based on that improved dataset, you can train a custom embedding for the labels or a classification layer on top of the image embedding network.

1 comments

thank you, will look into BLIP2.