|
|
|
|
|
by visarga
2174 days ago
|
|
I think we're far from having used all the media on the internet to train a model. GPT-3 used about 570GB of text (about 50M articles). ImageNet is just 1.5M photos. It's still expensive to ingest the whole YouTube, Google Search and Google Photos in a single model. And the nice thing about these large models is that you can reuse them with little fine-tuning for all sorts of other tasks. So the industry and any hacker can benefit from these uber-models without having to retrain from scratch. Of course, if they even fit the hardware available, otherwise they have to make due with a slightly lower performance. |
|
And certainly fine-tuning and distillation are part of the story why we wanted these large do-all-be-all models in the first place, but the question of what's next for the state of the art - and that currently would be featurization through a large transformer model (i.e. BERT, ERNIE, GPT-2) with some deep-but-not-huge task-specific model on top - isn't simply answered by "more compute".