It would be implemented like auto-completion. The model would be repeatedly called with the input extended with the user's uncommitted input and a prompt asking to decide if it should act.
A solution couls be a model trained on the exact timeline of some text being typed that can predict how long it will take for the user to type the predicted text
eg. "I need a plane ticket to Ha" - 730ms -> "I need a plane ticket to Hawaii"
The model would detect deviations from the estimated time and invoke the main LLM. This could work for spoken word too, it would just be trained on real speech instead of typing.