Hacker News new | ask | show | jobs
by mccorrinall 1257 days ago
It’s easier to train on all github repos if you own github. There is no real alternative to codex.

Stuff like GPT3 is trying to be recreated, but even the eleuther AI guys only collected like 800GB training data, which is much less than what OpenAI has (iirc around 45TB). And apparently their data is very high quality. EleutherAI is pretty much one of the few big model open source competitors with GPT-Neox etc.

Plus openai has great branding.

1 comments

Interesting, thanks for sharing!

I wonder how LaMDA compares performance wise to ChatGPT. I definitely understand why training on Github is an advantage, but I'd expect Google to also be great at getting a good dataset, across the range of things they'd be interested in.