Hacker News new | ask | show | jobs
by binarymax 2381 days ago
Word embeddings are unsupervised learning, so the features are not chosen, only the number of features. The model then learns the scalars for each feature as a single vector depending on the algorithm/architecure.

When using CBOW, for instance, with a set window size N, the features learned for a single term are based on the order of the preceding N terms.

This will result in similar vectors for terms appearing in the same context. It has its pros and cons though - a great example being “the reservation is confirmed” vs “the reservation is cancelled” - where confirmed/cancelled will have similar features.