Hacker News new | ask | show | jobs
by jimfleming 2719 days ago
I have experience designing these kinds of computer vision systems for other applications. What you describe is very doable. I'd start by isolating each component you want to identify and track. Then break that list down into multiple models that run in multiple passes over the video. While doing this try and group tasks that use the same inputs together, even if the outputs are different. This will help identify what exactly you need to isolate for each model. For example, some models may only need to see the top left of the video which can simplify things. You'll also probably want a context model to determine which frames should be seen by other models.

Avoid going all-in with a single end-to-end deep model right away. There are lots of details to work out and it will be easier to iterate as separate components. Ultimately you may eventually end up with a single model to leverage feature sharing and improve results and those details will be relevant.