Hacker News new | ask | show | jobs
by OutOfHere 651 days ago
For those who don't want to use a full-blown RAG database, scipy.spatial.distance has a convenient cosine distance function. And for those who don't even want to use SciPy, the formula in the linked post.

For anyone new to the topic, note that the monotonic interpretation of cosine distance is opposite to that of cosine similarity.

1 comments

SciPy distances module has its own problems. It's pretty slow, and constantly overflows in mixed precision scenarios. It also raises the wrong type of errors when it overflows, and uses general purpose `math` package instead of `numpy` for square roots. So use it with caution.

I've outlined some of the related issues here: https://github.com/ashvardanian/SimSIMD#cosine-similarity-re...

Noted, and thanks for your great work. My experience with it is limited to working with LLM embeddings, which I believe have been cleanly between 0 and 1. As such, I am yet to encounter these issues.

Regarding the speed, yes, I wouldn't use it with big data. Up to a few thousand items has been fine for me, or perhaps a few hundred if pairwise.