|
|
|
|
|
by cs-fan-101
1176 days ago
|
|
Someone posted this repost from the Cerebras Discord earlier, but sharing for visibility - "We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license" |
|
I'd chip in!