How could an AGI/ASI exist that isn't a next token predictor? It has to be able to generate a next token in a string of text. Otherwise it can't communicate.
It could be a diffusion model with a latent model of what needs to be said that will generate whole message or coversation (progressively) at once.
Although I love how next token prediction leads to text showing up gradually, in case of local models, accompanied by modulated coil whine of my GPU. It's how the 80s shown us the intelligent computers should communicate.
Don't worry, this is just humanity being too far up their own arse and conflating the map with the territory. Speech is a serialisation format, not the foundation of thought. Thus I think that any speech-first approach is inherently misguided. Speech must be a side effect.
I think that can only happen to empire cultures: they only learn one language, and suddenly people think that's all there is. I speak five languages, my wife seven. Language synthesis is a feature, not the entire product, in my experience. Btw, this is only the third best language I can speak/write in. I didn't use AI, autocomplete, spell check, or a dictionary to construct any of my posts. All typos and imperfect grammar are perfectly organicly sourced.
edit: I just remembered, don't we have tons of research suggesting that at least birds, whales and apes/monkeys use words and simple syntax? didn't we teach a few gorillas sign language/symbols?
As counter argument proposal; we should look into studies of children deprived of language until later in life. I have a dim memory of reading that one of these people never mastered complex language constructs. I could be that language and other cultural artifacts provide an "operating system" of sorts for the brain that allow higher level thinking. ?? (conjecture here by a complete layman)
Weird thing to brag about here, assuming it's even true. Furthermore, the "empire cultures" thing is clearly false since most researchers and other professionals in this field speak at least two or three languages. This is a global endeavor, not some pet project of a single language or culture.
And the power of "language models" (or any sort of deep learning, really), does not come from assuming that some specific input-output modality, like English text, is the ultimate foundation of thought. Strong versions of this claim were laid to rest around the time when GPT-2 came out. I'd also go further and argue that many people working on the symbolic AI of yesteryear already understood this as well.
Although I love how next token prediction leads to text showing up gradually, in case of local models, accompanied by modulated coil whine of my GPU. It's how the 80s shown us the intelligent computers should communicate.