By analyzing a video or picture (or whatever other sensor data you give it). It is a pretty straightforward classification problem, and imagine classification has been a major portion of ML research for decades.
There is no single video or audio that covers the life of a call, and likely never will be?
Additionally, body cams have extremely limited fields of view compared to humans. They’re fine for covering what is going on in the immediate vicinity, but useless for describing overall context.
That said, cops are notoriously terrible report writers, so meh.
It is a reference to specific testing of general models. Specific models can probably do this pretty easily, but genearl vision models acting on video still struggle. Not sure why this is controversial or down voted. Go and test it for yourself.