Hacker News new | ask | show | jobs
by Sai_ 1165 days ago
I’m not a data scientist but I think I know why one document could lead to many vectors.

(Happy to be corrected and/or schooled.)

A vector is a list of numbers each of which represents weight accorded to a certain word along a certain dimension.

Let’s take an example.

Is an “apple” a “positive” or a “negative” thing? Most people would associate positivity with apples. So, for the general population, the vector for “apple” along the 0-1 continuum where 0 represents negative sentiment and 1 represents positive sentiment would be something like [0.8].

Let’s add one more dimension. Is an apple associated with computers (1) or not (0)? For the majority of the world where Windows has a massive market share, “apple” would recall a fruit, not a sleek laptop. Therefore, the vector for apple along the computer/non-computer dimension is probably [0.3].

Taking this together, apple = [0.8, 0.3] where positionally, 0.8 is the value for positive/negative sentiment while 0.3 is computer/non-computer.

Agree?

(Hoping you do)

But that [0.8, 0.3] vector is for the general population.

Would a bible literalist who publishes blogs on bible stories feel the same way?

For someone like that, the notion of the original sin could taint their sentiments towards the apple. So they might weight an apple at 0.2 on the positive/negative line. Since they’re bloggers, it’s more likely they associate apple with computers so they might call it 0.5. Therefore, their apple vector is [0.2, 0.5].

Extend this to more content and you’ll see why there are more than one vector.

At least that’s how I understood it. Happy to be corrected and/or schooled.

2 comments

In my opinion, you could represent "apple" as a vector, for example, [0.99, 0.3, 0.7] in relation to [fruits, computers, religion]. Then, you can create different user vectors that describe the interests of various groups. For instance, the general population might have a vector like [0.8, 0.2, 0.1], geeks as [0.6, 0.95, 0.05], and religious people as [0.7, 0.1, 0.95].

By creating these user vectors, you can compare them with the "apple" vector and find the best match using ANN. This approach allows you to determine which group is most interested in a given context or aspect of the word "apple." The ANN will help you identify similarities or patterns in the user vectors and the "apple" vector to find the most relevant matches.

Thank you

I don’t know what ANN is but your comment raises two questions in my mind -

1. Where did your first vector of [0.99, 0.3, 0.7] come from? You later present the concept of user vectors which are vectors for different cohorts of users but don’t name the first vector as a user vector.

2. I feel my example of vectors for “general population users” and “bible literalist blogger” user aligns with your “user vector” concept.

Modern text embeddings are not word-based like that.
If my understanding and explanation are directionally correct, I’m happy. I’ll be the first one to admit I’m not a data scientist.

Do you have a good example of how an actual data scientist would present the idea of vectors as applied to sentences/documents to a layperson?