Hacker News new | ask | show | jobs
by rvnx 638 days ago
In short: no.

The vector databases are here to store vectors and calculating distance between vectors.

The embeddings model is the model that you pick to generate these vectors from a string or an image.

You give "bart simpson" to an embeddings model and it becomes (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)

You can imagine it like geometric points in space (well, it's a vector though), except that instead of being 2D, or 3D-space, they are typically in higher-number of dimensions (e.g: 768).

When you want to find similar entries, you just generate a new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230, 324, 234, ...) and send it to the vector database and it will return you all the nearest neighbors (= the existing entries with the smallest distance).

To generate these vectors, you can use any model that you want, however, you have to stay consistent.

It means that once you are using one embedding model, you are "forever" stuck with it, as there is no practical way to project from one vector space to another.

1 comments

that sucks :(. I wonder if there are other approaches to this, like simple word lookup, with storing a few synonyms, and prompting the LLM to always use the proper technical terms when performing a lookup.
Back of the book index or inverted indexes can be stored in a set store and give decent results that compare to vector lookups. The issue with them is you have to do an extraction inference to get the keywords.