Digital data is all 1s and 0s, whether it encodes words, sounds, or pictures. Why do you think transformers only work for predicting words, when they're already successfully being used for other applications as well?
I think much like with a basic Turing machine definition compute is possible on a variety of substrates that some kind of intelligence can be created with a whole class of implementations, transformers included. Indeed the video and image input of LLMs is one of the most exiting use cases.