Hacker News new | ask | show | jobs
by donsupreme 781 days ago
> We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.

This isn't training on top of existing weights from Llama-3, it's training using their own long context data, and it such a tiny set I wondering how strong its reasoning capability is.

2 comments

We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.

And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.

Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.

The table at the bottom says they initialized the 65K version from "LLaMA-3 7B"? (Assuming the 7B is a typo and they meant 8B.)

And each successive version with a larger window was initialized on the previous smaller one (65K -> 262K -> 524k -> 1048k).

Right. We are sleep deprived -- couldn't stop over the weekend. Please forgive the typos