Hacker News new | ask | show | jobs
by lars 2273 days ago
This is cool. For those who are not super familiar with language processing, I think it's good to point out the limitations of what's been done here though. They mention that professional speech transcription has word error rate around 5%, and that their method gets a WER of 3%. Sure, but the big distinction is that speech transcription must operate on an infinite number of sentences, even sentences that have never been said before. This method only has to distinguish between 30-50 sentences, and the same sentences must exist at least twice in the training set and once in the test set. Decoding word-by-word is really a roundabout way of doing a 50-way classification here.

It's an invasive technique, so they need electrodes on a human cortex. This means data collection is costly, so their operating in very low data regime compared to most other seq2seq applications. It seems theoretically possible that this could operate on Google translate level accuracy if the sentence dataset was terrabyte sized rather than kilobyte sized. That dataset size seems very unlikely to be collected any time soon, so we'll need massive leaps in data efficiency in machine learning for something like this to reach that level. They explore transfer learning for this, which is nice to see. Subject-independent modelling is almost certainly a requirement to achieve significant leaps in accuracy for methods like this.

1 comments

Is the following quote at odds with what you are saying about 50-way classification?

"On the other hand, the network is not merely classifying sentences, since performance is improved by augmenting the training set even with sentences not contained in the testing set (Fig. 3a,b). This result is critical: it implies that the network has learned to identify words, not just sentences, from ECoG data, and therefore that generalization to decoding of novel sentences is possible."

The difficulty of the problem is that of a 50-way classification. If the only goal was to minimize WER, a simple post-processing step choosing the nearest sentence in the training set could easily bring the WER down further. They've chosen to do it the way they did it presumably to show that it can be done that way, and I don't fault them for it.

They claim that word-by-word decoding implies that the network has learned to identify words. This may well be true, but it isn't possible to claim that from their result. For example, let's say you average all electrode samples over the relevant timespan, transform that representation with a FFW neural net, and feed that into the an RNN decoder. It would still predict word-by-word, on a representation that necessarily does not distinguish between words (because the time dimension has been averaged over). Such a model can still output words in the right order, just from the statistics of the training sentences being baked into the decoder RNN.