|
|
|
|
|
by joennlae
979 days ago
|
|
Llama1 --> 1.0T
Llama2 --> 2.0T
Mistral --> ?? They do not publish how many tokens it is pre-trained on, additionally to sharing no info on datasets used (except for fine-tuning). To my knowledge, no one has trained a larger LLM (>250M) to the capacity limit.
As discussed in the original GPT3 paper (https://twitter.com/gneubig/status/1286731711150280705?s=20) TinyLlama is trying to do that for 1.1B: https://github.com/jzhang38/TinyLlama As long as we are not at the capacity limit, we will have a few of these 7B beats 13B (or 7B beats 70B) moments. |
|