Hacker News new | ask | show | jobs
by languagehacker 4337 days ago
This is pretty badass. I'm assuming unseen words are really what's left to work on. If you ensemble this model it with one that uses the same ideas but generalizes outside of specific terms, you might be able to get there. For instance, generate a matrix that represents n-gram word sequences as each word's part of speech and semantic category. When making predictions on unseen words only, you then can use those values to help guide your prediction. You could use cues in phonology and morphology to predict the unseen word's semantic category. You could build off that value with cues from morphology and word ordering to predict the part of speech of the word. Once you have that, and the information for adjacent, existing words, you might be able to make a more reliable prediction on even hapax legomena.
1 comments

Yes, the method you propose for inducing a representation for unseen words is sound.

However, once you can train on almost one trillion tokens, the issue of unknown words is not going to happen very often. i.e. what's really left to work on is inducing higher quality representations of observed words. The goal would be that a simple model could inject these representations and perform well on, say, the word analogies task (or any other pure lexical semantics task).

What's interesting about Pennington et al's work for me is how they found a really fast training method, and thus could train on 840B tokens from Common Crawl. I've spent a lot of time thinking about this problem, and this approach is quite elegant.