There's an ocean of difference between optimizing for a single wakeword and the class of models that are taking off today. I'm excited for more on-board processing, because it will mean less dependency on the cloud.
I'm not going to argue that there isn't a difference when going from 0 -> 1 and 1 -> 10, or in this case, from 1.5b (Whisper-large) -> 1.7t parameters (gpt-4). But it's not like we don't know how to do it, so it won't take 10 years to get there.
Siri’s wake word stuff is also terrible, she gets constantly activated whenever I have my Apple Watch near running water, frying food or anything else that makes a white noise-type sound.