In this benchmark people or models are given a text, and later asked a number of questions. Questions are quite real. See for example here: https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/S...
Models already have performance which are as good as human's. This is real. This is not hype.