| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dogline 895 days ago

Six paragraphs in, and I already have questions.

> Hello -> [1,2,3,4] World -> [2,3,4,5]

The vectors are random, but they look like they have a pattern here. Does the 2 in both vector mean something? Or, is it the entire set that makes it unique?

2 comments

dan-robertson 895 days ago

The number reuse is just the author being a bit lazy. You could estimate how similar these vectors are by seeing if they point in similar directions or by calculating the angle between them. Here they are about 60° apart and somewhat the same direction, but a lot of this is that the author didn’t want to put in any negative numbers in the example so vectors end up being a bit more similar than they would be really.

That the numbers are reused isn’t meaningful here: a 1 in the first position is quite unrelated to a 1 in the second (as no convolutions are done over this vector)

link

dogline 895 days ago

Thank you. I guess I need to back up. This is a vector, not just an identifier, and direction and angle seem important. I need to look up how the encoding is normally done, since this isn't obvious if you haven't worked in this domain before.

link

kevindamm 895 days ago

The encoding is typically learned, and if possible is part of the ANN so that it can be adjusted along with the other parameters.

A good place to start on that topic is the word2vec paper.

link

smaddox 895 days ago

That isn't a very good example. The vectors for each token are randomly initialized with each element taken from the normal distribution. After training, similar words will have some cosine similarity, but almost never as much cosine similarity as [1,2,3,4] and [2,3,4,5].

link