Why? Surveillance and voice recognition has been possible for a long time. Low cost triggering on phrases without full analysis as well. There's no point using a system like this for real-time processing if you can capture the stream for batch processing later. This system even removed some transformer layers to sacrifice accuracy for lower latency.
Basically, if anyone wanted to do surveillance like this, they were always doing it with public or secret implementations. This won't massively help anyone.
I disagree. I think the fact that companies like Roblox still struggle with moderating discussions and are doing research into this area proves the point that it's not as solved a problem as you are stating it is.