Hacker News new | ask | show | jobs
by dogline 895 days ago
Six paragraphs in, and I already have questions.

> Hello -> [1,2,3,4] World -> [2,3,4,5]

The vectors are random, but they look like they have a pattern here. Does the 2 in both vector mean something? Or, is it the entire set that makes it unique?

2 comments

The number reuse is just the author being a bit lazy. You could estimate how similar these vectors are by seeing if they point in similar directions or by calculating the angle between them. Here they are about 60° apart and somewhat the same direction, but a lot of this is that the author didn’t want to put in any negative numbers in the example so vectors end up being a bit more similar than they would be really.

That the numbers are reused isn’t meaningful here: a 1 in the first position is quite unrelated to a 1 in the second (as no convolutions are done over this vector)

Thank you. I guess I need to back up. This is a vector, not just an identifier, and direction and angle seem important. I need to look up how the encoding is normally done, since this isn't obvious if you haven't worked in this domain before.
The encoding is typically learned, and if possible is part of the ANN so that it can be adjusted along with the other parameters.

A good place to start on that topic is the word2vec paper.

That isn't a very good example. The vectors for each token are randomly initialized with each element taken from the normal distribution. After training, similar words will have some cosine similarity, but almost never as much cosine similarity as [1,2,3,4] and [2,3,4,5].