Wonder if limiting commands will reduce misinterpretations. I once asked Google maps a question about the trip and it played a very obscure song. I asked it to stop and it played another obscure song.
As far as I understand, it uses general STT (which tries to transcribe everything, unlike say Picovoice which limits interpretation to only a few commands) + intent recognition. It probably can't interpret an utterance "stop" as anything other than its matching intent (even a bag of words classifier can) and since the STT's still the same, it probably won't change a thing.