| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PeterisP 2173 days ago

I think that your particular example is very relevant.

Of course a good speech recognition system needs to model all the relevant characteristics of the human vocal tract as such, and of the many different vocal tracts of individual humans!

But this is substantially different from the notion of integrating a human-made model of the human vocal tract.

In this case the bitter lesson (which, as far as I understand, does apply to vocal tract modeling - I don't personally work on speech recognition but colleagues a few doors down do) is that if you start with some data about human voice and biology; you develop some explicit model M, and then integrate it into your system, then it does not work as well if you properly design a system that will learn speech recognition on the whole, learning an implicit model M' of the relevant properties of the vocal tract (and the distribution of these properties in different vocal tracts) as a byproduct of that, given sufficient data.

A hypothesis (which does need more research to be demonstrated, though, but we have some empirical evidence for similar things in most aspects of NLP) on the reason for this is that the human-made model M can't be as good as the learned model because it's restricted by the need to be understandable by humans. It's simplified and regularized and limited in size so that it can be reasonably developed, described, analyzed and discussed by humans - but there's no reason to suppose that the ideal model that would perfectly match reality is simple enough for that; it may well be reducible to a parameteric function that simply has too many parameters to be neatly summarizable to a human-understandable size without simplifying in ways that cost accuracy.