Hacker News new | ask | show | jobs
by miki123211 798 days ago
A vector DB is the complete opposite of what you describe, it maps list<double> to pair<file, string>.

The queries it's good at are not "what vectors map to this filename", but "what pieces of text are closest to this vector, and what metadata do we have about them?" This is a non-trivial problem to solve if you don't want your queries to be O(n) where n is the dataset size.

This is useful because AI models can transform any kind of content (usually text or images) into vectors, in a way that content similar in meaning is transformed to vectors that are close to each other. This can be used e.g. find all documents related to your search query, even if your search keywords are never directly mentioned, to find articles similar to the one you're currently reading, to search images by their descriptions, or even to see how closely a user submission matches "undesirable" content, like spam or porn.

I agree that specialized vector databases are a little silly though, considering that Postgres and others have vector extensions now.

2 comments

The specialized vector database performs well when processing pure vector tasks but performs badly when it comes to SQL compatibility and integration with the existing system; And the traditional database with vector algorithm or vector plug-in like ES, PG, and Redis, achieves the vector function, the advantage is that it is very easy to create tasks in a production environment, but when the data scale is relatively large, they will quickly encounter performance bottlenecks.

There is a new type of vector database that combines the best of both worlds, which is MyScale, the SQL vector database. You can refer to the following blogs to see the comparison. our comprehensive benchmark evaluation reveals that MyScale exceeds other products in terms of filtered vector search accuracy, performance, cost-efficiency, and index build time by a long way. Importantly, MyScale is the only product tested that delivers healthy search accuracy and QPS across various filter ratios.

https://myscale.com/blog/myscale-outperform-specialized-vect... https://myscale.com/blog/myscale-vs-postgres-opensearch/

I know vector DBs x embeddings, so I'm afraid I'm just awful at communicating: to wit, and much to my consternation, I have to write and maintain code for both image and text embeddings, on 6 platforms.

I think we're getting to the heart of my confusion, and I only assume it's because of different use cases/expectations on privacy.

Lets say I'm CEO of Mousetrap Inc., and I got this .txt file, our top secret plan for a better mousetrap.

I want genAI to pick out the parts about the new metal alloy.

I upload the file to B2BAI LLC, who turns it into List<String>, then we give it to the model and get back List<List<Float>>.

Vector DBs store the List<String> and the List<List<Float>> for retrieval.

I, the top secret mouse-trap inventor, do not want my plan stored on any 3rd party computer.

But, this app I use puts it in an a16z approved Vector DB™.

The vector DB provider now has the embeddings (List<List<Float>>) and the chunks (List<String>), which violate my desire to not have my top secret mousetrap plan stored at rest anywhere .

This is silly.

Big companies who are extremely protective of their secrets use the cloud. Even the US government isn't afraid to store classified information in AWS, and they're not joking around with secrecy.

Unless you're acting specifically against American interests, I can't imagine a situation in which a cloud company would actually steal your secrets.

If anything, I'd be afraid of a vector DB vendor getting hacked, but I don't think that most non-tech companies who want to use vector embeddings for their documents can provide better security themselves.

you're right, my threat model is vector DB provider gets hacked, like you.

It's not silly because it takes 1 swe week, max, from start to finish, to just do it in memory locally. You don't need the Vector DB(tm)