It's a language model. It assigns probabilities to tokens in a sequence. You give it a number of options and it responds with the one that it assigns the highest probability to. If there's nothing in the options you give it that makes sense in the context of your test phrase, then it will return something that doesn't make sense. If some of your options make sense, it might return something that makes sense, or not.
So if you put it in a situation where nothing it outputs makes sense (to you) then none of its output will make sense. But that's not fair to the poor model.
It would be nice if it looked at the values of the probabilities and said "I don't understand the question" if the numbers are too low. Or for fun, it could point out how stupid the question was.
Yes, this is an important challenge. There has been a lot of interest in the NLP community right now, particularly around QA tasks [1] Standard supervised models do it well, but zero-shot models still have trouble.
It would be nice, but it's hard to know what probability is "too low". In short, the probability assigned by a model to a sequence of tokens can be arbitrarily low. There are things that are very unlikely to be said, but not impossible... and we still want them to be assignad some non-zero probability by a language model. So it's very difficult to choose a threshold that won't possibly exclude a large part of the sequences recognised by a language model.
"What should I use to whisk a bowl of eggs? A fish or a fork?"
"A fork"
Repeat with "...A spoon or a duck?" "A chopstick or a goat?" "A cat or an electric whisk?"