Hacker News new | ask | show | jobs
by tsumnia 5105 days ago
How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used on the website seems to point to a lot of the same things. Is this breaking it down to actual IPA phonemes?

I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').

2 comments

This approach still uses HMMs, it's just that the observation probabilities are now coming from a DNN (neural network) instead of a GMM (gaussian mixture model). "Senones" are not new, HTK can use various context dependent phoneme models, and the HMM states (typically 3) within each context dependent phoneme essentially boil down to what they call a "senone" here. Interestingly, they use GMM's to bootstrap the DNN training -- which I suppose you could avoid once you have a reasonable DNN laying around.

The main difference here is hooking DNN output to an HMM decoder, replacing GMMs, and possibly even more important the training process they use to get the DNN fairly efficiently. That's the biggest thing -- GMMs, at least the last time I've looked, can be trained and adapted much quicker than a DNN.

(I'm not an expert)

I think the HTK doesn't use neural networks at all. What it does is simply computes the MFCC of the sound signal and use it as input to a chain of HMM models. Well, "simply" that, plus the dozens of refinements and tweakings to make that work well.

Here, I guess they do some sort of preprocessing on the sounds features using their deep neural networks before feeding the whole thing to the HMMs.