| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by supercarrot 3656 days ago

Interesting observation of computational complexity.

HMMs require Viterbi algorithm to find the desirable sequence. The complexity of the algorithm depends on the number of previous decisions on which we condition the current one. Time complexity is O(n^(k+1) * S) where n is the length of the sequence, k is the number of previous decisions we condition on, and S is the number of possible decisions.

Now, why does HMM require the Viterbi search procedure, and LSTM/RNN doesn't?

Inference in LSTM is linear in the length of the sequence -- ignoring the classifier decision time.

Why does this work without Viterbi?

Given the fact that even Conditional Random Fields need dynamic programming, and that Maximum Entropy Markov Model suffers the inability to learn quickly without dynamic programming and make inference not suffer label bias.

What is so special about LSTM that it doesn't need DP.

RNN obviously learns a much better representation of features. Still, why then won't HMMs work with that same representation without the search part?

Really baffles me.

Although, there exists a process that doesn't use NN for sequence labeling and it is linear in the length of sequence and works well. Instead of using a simple classifier as is the case in MEMMs, one can use a cost-sensitive classifier and learn for each decision associated future loss and greedily avoid the loss. This allows the usage of a simple multiclass classifier, removes the search part and can work as well as NN given the right features.

1 comments

kastnerkyle 3656 days ago

RNN for discrete structured prediction tasks often use a technique called beam search during decoding (see for example papers on neural MT), which is about as close as you can get to a true Viterbi when you don't have independence assumptions. The CTC mentioned above is also a special type of "reductionist" dynamic programming which uses some awesome and clever stuff to remain backproppable. It only works for monotonic alignments (input larger than desired output).

link

supercarrot 3656 days ago

Still, beam search can be simulated and improved by running decision processes in parallel.

For example, instead of learning the sequence labeling as a sequence of n decisions (where n is the length of the sequence) you can learn sequence labeling as a sequence of 3n+1 decisions where you make 3 decisions for each sequence element and after 3n decision pick one out of three decision streams that minimizes loss using an extra decision. (when inference is done then the classifier will, hopefully, pick the stream that minimizes test loss).

This simulates a beam search and can be done during learning and inference and is probably more effective than picking confidence scores of particular decisions and keeping a beam of most confident partial sequences.

Bean search is a heuristic thing that improves performance and is done mostly to allow you to correct mistakes you made at the beginning of the process.

http://arxiv.org/abs/1603.06042

they illustrate the problem well (label bias).

the question remains the same, for example, in the paper above they approximate the partition function of CRFs with a beam but get superior results to other structured prediction methods.

link

kastnerkyle 3656 days ago

Sure, there are a lot of potential things to do - either by enumerating towards the full "brute force" loss partway as I think you are hinting toward, or treating as a subsequences, or something else [0, 1, 2]. The point is, IMO RNNs do need some kind of structured loss (more than per step likelihood) to be competitive with HMM approaches using Viterbi decoding, in addition to some sequence smarts (such as beam search) in decoding.

At least in NMT, enumerating the possible decisions as 3n + 1 is almost impossible since the softmax size is generally the memory bottleneck in training - and a bigger vocabulary is typically a huge win. It is more feasible in speech, but often your labels are themselves triphones and you end up with a pretty large vocabulary too.

Figuring out how to get RNNs closer to on par with DNN-HMMs with Viterbi decoding (or full on sequence training [3]) through something like "deep fusion" with a language model (or something else) is something I am very interested in.

[0] http://arxiv.org/abs/1312.6082

[1] http://arxiv.org/abs/1511.06456

[2] https://arxiv.org/abs/1511.04868

[3] http://www.danielpovey.com/files/2013_interspeech_dnn.pdf

link

supercarrot 3656 days ago

> IMO RNNs do need some kind of structured loss (more than per step likelihood) to be competitive with HMM approaches using Viterbi decoding

This is exactly what they do with the dependency parser I've cited, so your opinion is definitely valid. Although their approach is not general, given the fact that they approximate hamming loss with log-loss and again make it work only on sequences.

http://arxiv.org/abs/1502.02206

paper above also has a very good analysis on how to remove the search component of the inference and allow linear time complexity with competitive results.

consistent (as it is used in the machine learning theory of reductions) reduction from structured learning to multiclass classification seems to be possible. I just haven't seen anyone couple the learning procedure with neural networks. (Daume did mention they trained RNNs with the reductionist approach but seems that the code didn't make it to vowpal wabbit).

the approach above works with any loss you want (from F-score to any weird thing you might think of), the loss doesn't have to decompose over the structure (one can just announce the loss after the labelling is done and learn from that loss), it can work on any kind of structure, from images to sequences to documents for translation. it can also use a O(log n) consistent reduction of multiclass classification if speed is of the issue and if number of classes is large. It can easily work as an online method too, not requiring the full structured input.

for example, simple sequence tagging works (depending on the number of possible labels) around 500k tokens per second :D word count is only 2-4 times faster than that :D

there still aren't any papers using the above consistent reduction in the framework of NNs but I guess they'll soon be coming.

link