Hacker News new | ask | show | jobs
by malux85 977 days ago
Generating the embeddings is by far the slowest part of that, but it’s embarrassingly parallel so if you have $ it can be done that quick.

When I worked for Dubai airport, I was tasked with building a vector similarity search that was highly optimised for query speed, in the end I ended up holding the vectors in memory (in a numpy array) and using scipy to do cosine similarity, I could get about 1.2 million vectors per second per core after tweaking and optimising, again this is embarrassingly parallel so if you have more vectors, chunk it to fit your hardware and you should get more or less linear scale with that per core.

If you want a hand writing this let me know.

(Also there’s a lot of caveats here, for example they did not need to update the vectors, it was an extremely read-heavy usage pattern)

1 comments

Actually if you set it up right it doesn't cost you anything more to do it if it's almost fully parallel. It doesn't matter if you paid for one GPU instance for 500 hours or 50 for 10 hours. The cost is about the same. You can also more confidently use spot pricing to reduce the cost.