Hacker News new | ask | show | jobs
by sqrt17 2173 days ago
Here's a thing: incorrect assumptions that are built into a model are more harmful than a model that assumes too little structure. If you model the vocal tract and the actual exciting things are the transient noises that occur when we produce consonants, at best there's lots of work with not much to show and at worst you're limiting your model in a negative way. That's the basis for the "every time we fired a linguist, recognition rates improved" from 90s speech recognition.

On the other end of the spectrum, data and compute ARE limited and for some tasks we're at a point where the model eats up all the humanity's written works and a couple million dollars in compute and further progress has to come from elsewhere because even large companies won't spend billions of dollars in compute and humanity will not suddenly write ten times more blog articles.

1 comments

I think we're far from having used all the media on the internet to train a model. GPT-3 used about 570GB of text (about 50M articles). ImageNet is just 1.5M photos. It's still expensive to ingest the whole YouTube, Google Search and Google Photos in a single model.

And the nice thing about these large models is that you can reuse them with little fine-tuning for all sorts of other tasks. So the industry and any hacker can benefit from these uber-models without having to retrain from scratch. Of course, if they even fit the hardware available, otherwise they have to make due with a slightly lower performance.

GPT-3 is too large to be useful for practical purposes. Look it up. It's the equivalent of a Formula 1 car or a Saturn V rocket - an impressive feat of technology but of no practical relevance for getting you to work and back.

And certainly fine-tuning and distillation are part of the story why we wanted these large do-all-be-all models in the first place, but the question of what's next for the state of the art - and that currently would be featurization through a large transformer model (i.e. BERT, ERNIE, GPT-2) with some deep-but-not-huge task-specific model on top - isn't simply answered by "more compute".