Hacker News new | ask | show | jobs
by joennlae 979 days ago
Llama1 --> 1.0T Llama2 --> 2.0T Mistral --> ??

They do not publish how many tokens it is pre-trained on, additionally to sharing no info on datasets used (except for fine-tuning).

To my knowledge, no one has trained a larger LLM (>250M) to the capacity limit. As discussed in the original GPT3 paper (https://twitter.com/gneubig/status/1286731711150280705?s=20)

TinyLlama is trying to do that for 1.1B: https://github.com/jzhang38/TinyLlama

As long as we are not at the capacity limit, we will have a few of these 7B beats 13B (or 7B beats 70B) moments.