Hacker News new | ask | show | jobs
by RobinL 712 days ago
One useful technique here could be to use text embeddings and cosine similarity: https://simonwillison.net/2023/Oct/23/embeddings/
1 comments

love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.
You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0).