| HN Mirror

The difficulty of the problem is that of a 50-way classification. If the only goal was to minimize WER, a simple post-processing step choosing the nearest sentence in the training set could easily bring the WER down further. They've chosen to do it the way they did it presumably to show that it can be done that way, and I don't fault them for it.

They claim that word-by-word decoding implies that the network has learned to identify words. This may well be true, but it isn't possible to claim that from their result. For example, let's say you average all electrode samples over the relevant timespan, transform that representation with a FFW neural net, and feed that into the an RNN decoder. It would still predict word-by-word, on a representation that necessarily does not distinguish between words (because the time dimension has been averaged over). Such a model can still output words in the right order, just from the statistics of the training sentences being baked into the decoder RNN.