| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haldujai 1017 days ago

> data you've only seen once

Is this still true given that they're upsampling in the pretraining dataset? I don't recall any details on how and to what extent they did this in the Llama2 paper but presumably some fraction of those 2T training tokens is repeated data.

MetaAI hasn't been as averse to repeated tokens as other groups, they trained the now forgotten about Galactica for multiple epochs with good results.

> The validation curves would be considerably more convincing.

What are they validating on? I was under the impression they weren't splitting the pretraining corpus.

1 comments

stephenroller 1017 days ago

The llama1 team did not have a validation set. I don’t know what the Llama2 team did - I left before seeing any of the details.

My guess is Llama2 upsamples Wikipedia a good bit, but given they didn’t report any information about training data, it’s hard to say.

link