|
|
|
|
|
by formercoder
875 days ago
|
|
When I prototype RAG systems I don’t use a “vector database.” I just use a pandas dataframe and I do an apply() with a cosine distance function that is one line of code. I’ve done it with up to 1k rows and it still takes less than a second. |
|
Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.
What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.
So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.
So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.
What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.
The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.
The game changed when we switched from word2vec to LLMs to generate embeddings.
1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.