|
|
|
|
|
by antimatter15
1176 days ago
|
|
Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago. I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B. |
|
"We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license"