| The author of the article appears to have misunderstood one important detail about Code Llama. They state: > The Code Llama models were trained on 500B tokens, whereas Llama 2 models were trained on 2T tokens. Since the Code Llama model was trained on 4x fewer tokens, maybe a CodeLlama 70B version did not perform well enough due to LLM scaling laws—there was not enough training data. But if you read the paper, on page 1, it says: > Our approach is based on gradually specializing and increasing the capabilities of Llama 2 models by applying a cascade of training and fine-tuning steps [...] In fact, they show a diagram at the top of page 3 that details the process, starting with Llama 2 foundation models. Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B -> Python / Long Context. See the paper here:
https://arxiv.org/abs/2308.12950 |
What I meant to say here was 500B domain-specific tokens. Maybe domain-specific is not the right word here, but tokens related to the problems that the LLM aims to solve.
EDIT: Updated the text to be more clear.