I would add that the above comparison is misleading, because humans have a massive advantage in that they have prior knowledge of what words mean. A more apples-to-apples comparison would have the human do next word prediction on a language they don't know.
This would be akin to me giving you a few GBs of Chinese text, with no grounding or translation, then try to communicate with you in Chinese after you've read the whole thing.
The human being, who instructed the computer to use it to do the language modeling? What does attention have to do, exclusively, with language modeling?
I would add that the above comparison is misleading, because humans have a massive advantage in that they have prior knowledge of what words mean. A more apples-to-apples comparison would have the human do next word prediction on a language they don't know.
This would be akin to me giving you a few GBs of Chinese text, with no grounding or translation, then try to communicate with you in Chinese after you've read the whole thing.