For people interested in a cleaned-up, commented and de-obfuscated word2vec, I recently ported the original C code to Python [1].
My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].
Mikolov said that he hoped word2vec would "significantly advance the state of the art" of NLP, but really the state of the art can only advance when people can understand and manipulate the code. You're making that possible. Thank you.
Word2vec seemed intuitively obvious me, but I really have a hard time believing that it works in only 1000 dimensions, generating results beyond cherry picked demo examples.
Are there really only 1000 independent concepts in the English language?
No but with n binary dimensions (with value 0 or 1) you can encode 2^n unique identifiers.
So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.
Note: the default dimensionality of word2vec is 100 instead of 1000. Apparently you can get better results with dim=300 and a very large training corpus. To leverage higher dimensions you need: more CPU time to reach convergence and a lot more data to leverage the added model capacity.
I'm still impressed it only takes 26 letters, in words of average size around 5! By comparison, 1000 continuous dimensions seems positively resplendent with expressiveness.
FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).
The vectors learned from word2vec are pretty amazing. A few days after the tool was released I wrote a script which uses the vector representations to figure out which word in a list isn't like the others [1]. Things like:
Eventually computers will be talking about us behind our backs in these high-dimensional vectors, only occasionally translating down to English approximations, to humor us. "Goo goo, gah gah, human?"
My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].
[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...
[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...