I have a universal benchmark for judging how much knowledge a language model stores, and it's asking about the G-FOLD paper (https://www.lpi.usra.edu/meetings/marsconcepts2012/pdf/4193....), because I noticed GPT-3.5 hallucinates when asked about it, whereas GPT-4 is capable of providing a high-level overview.