Hacker News new | ask | show | jobs
by munchler 799 days ago
Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.

The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.

4 comments

The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.
That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.
> disparate information

I wouldn't call it disparate when there's about a dozen wikis each spelling it out like this: https://harrypotter.fandom.com/wiki/Severus_Snape

If eat my hat if multiple graphs almost exactly like this one weren’t in the training days. This is like fandoms 101.
The frustrating thing about all this speculations is, that we don't know what was in the training data, but I think we should know that, to have any meaningful discussion about it.
We should. However in this case, isn't it a bit of a stretch to assume they didn't put just about everything in the training data?
It would have been fairly trivial to AB test this where the other side is to ask the same question but without all the books in-window.
It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.
The generic models definitely know a lot about Harry Potter without any additional context.

Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.

That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.
It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.

I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?

You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.