| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rryan 802 days ago
	ML 101: Do not evaluate on the training data. Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.

4 comments

barfbagginus 802 days ago

Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?

I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.

Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.

munchler 802 days ago

Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.

The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.

viraptor 802 days ago

The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.

munchler 802 days ago

That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.

chmod775 801 days ago

> disparate information

I wouldn't call it disparate when there's about a dozen wikis each spelling it out like this: https://harrypotter.fandom.com/wiki/Severus_Snape

vsnf 802 days ago

If eat my hat if multiple graphs almost exactly like this one weren’t in the training days. This is like fandoms 101.

lukan 801 days ago

The frustrating thing about all this speculations is, that we don't know what was in the training data, but I think we should know that, to have any meaningful discussion about it.

actionfromafar 801 days ago

We should. However in this case, isn't it a bit of a stretch to assume they didn't put just about everything in the training data?

joshspankit 801 days ago

It would have been fairly trivial to AB test this where the other side is to ask the same question but without all the books in-window.

xmprt 802 days ago

It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.

aikinai 801 days ago

The generic models definitely know a lot about Harry Potter without any additional context.

Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.

munchler 802 days ago

That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.

paxys 802 days ago

It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.

I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?

poglet 801 days ago

You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.

viraptor 802 days ago

It seems from the replies that he tried it without the context too and didn't get as detailed answers. I'd really like to see the actual difference, but yeah, it would be so much more interesting to use books which aren't summarised and discussed all over internet.

magospietato 801 days ago

I got some interesting results by feeding Claude 3 a very sparse primer for a conlang I wrote when I was 18.

There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.