Hacker News new | ask | show | jobs
by alexedw 1021 days ago
This is silly. Look at the loss and benchmark curves for the Pythia suite of models - the smaller models certainly did saturate and in fact began worsening.

2T not saturating on a 7B is very different from 3T on a 1B.

1 comments

That's the point of the experiment actually…