|
|
|
|
|
by donsupreme
781 days ago
|
|
> We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data. This isn't training on top of existing weights from Llama-3, it's training using their own long context data, and it such a tiny set I wondering how strong its reasoning capability is. |
|
And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.
Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.