|
|
|
|
|
by Kubuxu
895 days ago
|
|
You could consider something like BLIP2. There are multiple ways you could use it: embed images and match them against embeddings of text descriptors, train custom embedding of descriptors, or a classifier on top of the embedding (linear layer on top of the image embedding network). The approaches increase in complexity. It also allows for dataset bootstrap: Let's say you want to classify cats by breed. You could start by embedding images and text descriptors and distance-matching the embedded descriptors to the images. This gives you a dataset that might be 90% correct; you can then clean it up, which would be easier to do than manually labelling it.
Based on that improved dataset, you can train a custom embedding for the labels or a classification layer on top of the image embedding network. |
|