Hacker News new | ask | show | jobs
by vlmutolo 2173 days ago
It’s funny when you’ve been thinking for months about how speech recognition could really benefit from integrating models of the human vocal tract…

and then you read this

3 comments

Here's a thing: incorrect assumptions that are built into a model are more harmful than a model that assumes too little structure. If you model the vocal tract and the actual exciting things are the transient noises that occur when we produce consonants, at best there's lots of work with not much to show and at worst you're limiting your model in a negative way. That's the basis for the "every time we fired a linguist, recognition rates improved" from 90s speech recognition.

On the other end of the spectrum, data and compute ARE limited and for some tasks we're at a point where the model eats up all the humanity's written works and a couple million dollars in compute and further progress has to come from elsewhere because even large companies won't spend billions of dollars in compute and humanity will not suddenly write ten times more blog articles.

I think we're far from having used all the media on the internet to train a model. GPT-3 used about 570GB of text (about 50M articles). ImageNet is just 1.5M photos. It's still expensive to ingest the whole YouTube, Google Search and Google Photos in a single model.

And the nice thing about these large models is that you can reuse them with little fine-tuning for all sorts of other tasks. So the industry and any hacker can benefit from these uber-models without having to retrain from scratch. Of course, if they even fit the hardware available, otherwise they have to make due with a slightly lower performance.

GPT-3 is too large to be useful for practical purposes. Look it up. It's the equivalent of a Formula 1 car or a Saturn V rocket - an impressive feat of technology but of no practical relevance for getting you to work and back.

And certainly fine-tuning and distillation are part of the story why we wanted these large do-all-be-all models in the first place, but the question of what's next for the state of the art - and that currently would be featurization through a large transformer model (i.e. BERT, ERNIE, GPT-2) with some deep-but-not-huge task-specific model on top - isn't simply answered by "more compute".

I think that your particular example is very relevant.

Of course a good speech recognition system needs to model all the relevant characteristics of the human vocal tract as such, and of the many different vocal tracts of individual humans!

But this is substantially different from the notion of integrating a human-made model of the human vocal tract.

In this case the bitter lesson (which, as far as I understand, does apply to vocal tract modeling - I don't personally work on speech recognition but colleagues a few doors down do) is that if you start with some data about human voice and biology; you develop some explicit model M, and then integrate it into your system, then it does not work as well if you properly design a system that will learn speech recognition on the whole, learning an implicit model M' of the relevant properties of the vocal tract (and the distribution of these properties in different vocal tracts) as a byproduct of that, given sufficient data.

A hypothesis (which does need more research to be demonstrated, though, but we have some empirical evidence for similar things in most aspects of NLP) on the reason for this is that the human-made model M can't be as good as the learned model because it's restricted by the need to be understandable by humans. It's simplified and regularized and limited in size so that it can be reasonably developed, described, analyzed and discussed by humans - but there's no reason to suppose that the ideal model that would perfectly match reality is simple enough for that; it may well be reducible to a parameteric function that simply has too many parameters to be neatly summarizable to a human-understandable size without simplifying in ways that cost accuracy.

"Every time I fire an anatomist and hire a TPU pod, my WER halves."