Hacker News new | ask | show | jobs
by antimatter15 1176 days ago
Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.

I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.

1 comments

Someone posted this repost from the Cerebras Discord earlier, but sharing for visibility -

"We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license"

Sounds like we should crowd-fund the cost to train and open source one of these models with LLaMa-like quality.

I'd chip in!

TBH that seems like a good job for Cerebras.

There are plenty of such efforts, but the organizer needs some kind of significance to attract a critical mass, and a AI ASIC chip designer seems like a good candidate.

Then again, maybe they prefer a bunch of privately trained models over an open one since that sells more ASIC time?

> Cerebras Discord

This is really weird to hear out loud.

I still think of Discord as a niche gaming chatroom, even though I know that (for instance) a wafer scale IC design company is hosting a Discord now.