Hacker News new | ask | show | jobs
by bornhuetter 5103 days ago
Can someone please explain senones to me? Can't find much on Google.

The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?

Also - I'd be curious how much the phoneme in a word can vary based on accent.

2 comments

http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

"Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.

Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way."

Thanks. So senones are not just fragments of phones - two senones could sound exactly the same, but be classified differently depending on their context within the audio stream.
Senones are just tied triphone HMM states. A context dependent HMM recognizer has a 3-5 state HMM for every context dependent phone. Conceptually, each different HMM state in each different phone HMM has its own Gaussian mixture model, but this is awful because many of them don't get much data assigned to them. So people share parameters for different HMM states based on a data driven decision tree that clusters states together. Those clustered or tied states are sometimes called senones.