| At this stage the sibling comment to use GCP is pretty solid recommendation. You can use Google for labelling (Mechancial Turk style), and AutoML Vision to train your model. It's going to be a bit pricey, but cheaper than your time to do the equivalent and will give you an educated guess at how much work it'll be to beat it. It costs about $100 to train a cloud vision model, I think (not including labelling)? You can also try the API for free to see how well Google does at finding people, they have better off the shelf models than you can get publicly. https://cloud.google.com/vision/automl/docs/ You can try exploiting other things. Is your scene static? Try using frame differences as a feature. If it's a fixed environment then you should get a boost when fine tuning a model, versus some general person detector. COCO pretrained models should be quite good at finding people out of the box. I wrote my own labelling tool specifically for Yolo which you may find useful (ie you label your data and export to a train-ready format): https://github.com/jveitchmichaelis/deeplabel People who are not experienced are usually terrible at tagging images. They're not consistent, they miss objects and they don't understand why it's an issue. It will be faster to pay an "expert" service like mechanical turk, or do it yourself. Basically a lot of your questions are open research problems. How much data do you need? Not a clue. It depends how your model is failing, which is always worth checking anyway. Figure out what the model is bad at and try and improve it, it should be doable to figure out where that 25% is going. You should do better with a model like Faster-RCNN or its ilk. AutoML will do something like this, and you can try Facebookâs Detectron2 toolkit, or the Tensorflow Object Detection API. Detecting unique people is a hard problem, by the way (eg two people versus the same person detected twice). You're better off just using an established method like RFID tags for presence/absence. Another sibling made a great point. Don't detect people, train a model to output the number of people in the frame. This is how ML is applied to camera trap data with animals. In your case you can reduce this to a binary classification problem - >= 2 people, positive output. |