Hacker News new | ask | show | jobs
by rolisz 1022 days ago
There are embeddings that are trained to reflect similarity, for example SentenceBERT, where the training process pushes pairs of similar sentences (as defined by whoever built the dataset) to have closer embeddings and dissimilar sentences to be further apart.
1 comments

As the OP points out, Cosine similarity doesn't always equate to relevance. As I was expanding upon, things get really messy as the dimensions increase and your intuition about how vectors relate to one another goes out the window, and fast. Distributional mass is not uniform. Rate of originality increases. And of course, there is no guarantees that latent dimensions align with human meaningful semantic features. There's no pressure to align basis vectors with human perceived semantics. My argument isn't about that there isn't a similarity pressure it's that similarity in high dimensions means different things then similarities in low dimensions. For example, in high dimensions most of a unit cube's mass lies outside the unit sphere, while in 2 or 3 dimensions the unit cube is always contained inside with room to spare. High dimensions are weird and that's what my comment is about because many people are using their lower dimensional intuition for ML.