Hacker News new | ask | show | jobs
by jotaf 1752 days ago
Thanks, but I think that lucb1e's confusion was probably the same as mine -- given pretrained CLIP features, how is this translated to zero-shot tracking?

Are initial bounding boxes given as usual, or are objects of interest created automagically?

Or are they tracked just from text descriptions?

Lots of questions after reading the post :)

1 comments

It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.

This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.

[1] https://github.com/roboflow-ai/zero-shot-object-tracking

[2] https://universe.roboflow.com

[3] https://github.com/nwojke/deep_sort