|
|
|
|
|
by supercarrot
3656 days ago
|
|
Still, beam search can be simulated and improved by running decision processes in parallel. For example, instead of learning the sequence labeling as a sequence of n decisions (where n is the length of the sequence) you can learn sequence labeling as a sequence of 3n+1 decisions where you make 3 decisions for each sequence element and after 3n decision pick one out of three decision streams that minimizes loss using an extra decision. (when inference is done then the classifier will, hopefully, pick the stream that minimizes test loss). This simulates a beam search and can be done during learning and inference and is probably more effective than picking confidence scores of particular decisions and keeping a beam of most confident partial sequences. Bean search is a heuristic thing that improves performance and is done mostly to allow you to correct mistakes you made at the beginning of the process. http://arxiv.org/abs/1603.06042 they illustrate the problem well (label bias). the question remains the same, for example, in the paper above they approximate the partition function of CRFs with a beam but get superior results to other structured prediction methods. |
|
At least in NMT, enumerating the possible decisions as 3n + 1 is almost impossible since the softmax size is generally the memory bottleneck in training - and a bigger vocabulary is typically a huge win. It is more feasible in speech, but often your labels are themselves triphones and you end up with a pretty large vocabulary too.
Figuring out how to get RNNs closer to on par with DNN-HMMs with Viterbi decoding (or full on sequence training [3]) through something like "deep fusion" with a language model (or something else) is something I am very interested in.
[0] http://arxiv.org/abs/1312.6082
[1] http://arxiv.org/abs/1511.06456
[2] https://arxiv.org/abs/1511.04868
[3] http://www.danielpovey.com/files/2013_interspeech_dnn.pdf