That sort of test isn't super reliable either, in my experience.
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.