word2vec in yhat: Word vector similarity

Y	Hacker News new \| ask \| show \| jobs

	word2vec in yhat: Word vector similarity (danielfrg.github.io)
	59 points by dfrodriguez143 4641 days ago

6 comments

Radim 4641 days ago

For people interested in a cleaned-up, commented and de-obfuscated word2vec, I recently ported the original C code to Python [1].

My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].

[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...

[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...

link

dfrodriguez143 4641 days ago

Your submission receives no love but my one afternoon hack does... oh the humanity... lol

That is some amazing work, thanks!

link

bowyakka 4641 days ago

Its sad you didnt get the love on your submission; you changes are very neat and having word2vec inside gensim feels like a really awesome feature.

link

rspeer 4641 days ago

Well done!

Mikolov said that he hoped word2vec would "significantly advance the state of the art" of NLP, but really the state of the art can only advance when people can understand and manipulate the code. You're making that possible. Thank you.

link

judk 4641 days ago

Word2vec seemed intuitively obvious me, but I really have a hard time believing that it works in only 1000 dimensions, generating results beyond cherry picked demo examples.

Are there really only 1000 independent concepts in the English language?

link

ogrisel 4641 days ago

No but with n binary dimensions (with value 0 or 1) you can encode 2^n unique identifiers.

So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.

Note: the default dimensionality of word2vec is 100 instead of 1000. Apparently you can get better results with dim=300 and a very large training corpus. To leverage higher dimensions you need: more CPU time to reach convergence and a lot more data to leverage the added model capacity.

link

gojomo 4641 days ago

I'm still impressed it only takes 26 letters, in words of average size around 5! By comparison, 1000 continuous dimensions seems positively resplendent with expressiveness.

FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).

link

judk 4641 days ago

Yes, but there are exponentially more concepts than words. The words we have are sparse set of labels for particularly relevant combinations.

But yeah, the continuous dimensions can hide many more binary dimensions.

For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension, but that feels a bit like cheating.

So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit dimensions.

link

IanCal 4641 days ago

I wouldn't be surprised if you cover most basic english with 1000 concepts. That would give a lot of combinations.

link

3JPLW 4641 days ago

Very cool. I missed the original word2vec software discussion back in August: https://news.ycombinator.com/item?id=6216044

And the paper itelf is a very worthwhile read: http://arxiv.org/abs/1301.3781

link

dhammack 4641 days ago

The vectors learned from word2vec are pretty amazing. A few days after the tool was released I wrote a script which uses the vector representations to figure out which word in a list isn't like the others [1]. Things like:

->math shopping reading science

I think shopping doesnt belong in this list!

->rain snow sleet sun

I think sun doesnt belong in this list!

etc.

[1] https://github.com/dhammack/Word2VecExample

link

gojomo 4641 days ago

Eventually computers will be talking about us behind our backs in these high-dimensional vectors, only occasionally translating down to English approximations, to humor us. "Goo goo, gah gah, human?"

link

seiji 4641 days ago

Have you read the [Message Contains No Recognizable Symbols] series? It's pretty great: http://www.ssec.wisc.edu/~billh/g/mcnrs.html

link

gojomo 4641 days ago

Haven't but will check it out, thanks!

link

gojomo 4640 days ago

Cool web demo powered by word2vec, by Christopher Moody:

http://thisplusthat.me/

link