| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shri_krishna 989 days ago
	The one DB fits all approach only works when the size of the database is really small and never grows. Imagine you have 100 customers. Each customer generates, on average, a million 1536 dimension vector embeddings (considering OpenAI Ada dimensions which is the most popular right now). That is 6GB (1536 x 4 bytes per dimension for f32 x 1000_000) of just embeddings PER CUSTOMER. If you use HNSW it will take at least that much of RAM if not more. If you use PQ (and variants) you can reduce the size of index in RAM to say 512MB-1GB per customer. It is still quite a lot of memory requirement. That is just the way it is and there is no way around it. Now imagine you are using that database for storing transactions and other day to day business ops that will still be storing millions of records but with small indexes. This would have ideally only required a single DB instance with a replica for redundancy. Now if you integrate Vectors into the equation, you will have to needlessly scale this DB both horizontally and vertically just to maintain a decent query/write performance to your DB (which would have ideally been extremely fast without embeddings in the mix). You will eventually separate the embeddings out as it makes no sense for the entire DB to be scaled just for the sake of scaling your embeddings. I am not even accounting for index generation for these vectors which will require nearly 100% of all CPU cores while the index is being generated (depending on type of ANN you are using) and which in turn would slow your DB to a crawl.

2 comments

beoberha 989 days ago

Exactly - vector indexes are so different than traditional RDBMS B-Tree or LSM Tree indexes that it doesn’t make sense to use the same store for both unless it’s basically a toy app.

Someone makes the example in another comment, but it’s analogous to OLTP vs OLAP

link

Foobar8568 989 days ago

I don't even want to imagine the workload on a high txn OLTP mixed with OLAP access pattern. IMHO If you can, you don't need OLAP in the first place.

link

trashtester 989 days ago

My experience is that if you do the data modelling properly a well designed star schema with some aggregation tables or materialized views on top can often remove the need for dedicated OLAP software.

Now you do NOT want to run such a setup on the same hardware that you use for your transactional systems, of course. But you CAN use the same software (like Oracle), which means that you do get some reduction in tech complexity.

link

totetsu 989 days ago

Are there any DB that could support both use cases while being able to partition them in such a way that the transactions etc are only kept on part of the resources they need to be. Basically two seperated DBs but sharing the same interfaces and security etc.

link

shri_krishna 989 days ago

What you are talking about is possible to do in regular SQL dbs with extensions. However, when it comes to scaling traditional DBs don't have the necessary tools to do so automatically. Most extensions provide support for an underlying ANN algorithm it implements and there's that and nothing more. Everything else you'll have to hand roll yourself.

Clustering, load balancing, aggregating queries etc are quite different for a vector database in comparison to traditional OLTP databases.

It's the same as difference between OLAP vs OLTP. Both have different underlying architectural differences which make it incompatible for both to run in an integrated fashion.

For instance, in a traditional DB the index is maintained and rebuilt alongside data storage and for scaling you can separate it into read/write nodes. The write nodes typically only focus on building indexes while the read nodes for querying eventually consistent indexes (eventual consistency is achieved by broadcasting only the changed rows rather than sending entire index).

Now it's similar in vector dbs too. You can seperate the indexer from query nodes (which access eventually consistent index). However, the load is way higher than a regular db as the index is humongous/takes a long time to build and sharing the index with query nodes is also more time consuming and resource/network intensive, as you won't be sharing few rows but the entire index itself. It requires a totally different strategy to get all query nodes to be eventually consistent.

The only advantage of traditional DBs also implementing vector extensions is familiarity for the end user. If you are already familiar with postgres you wouldn't want to leave your comfort zone. However, scaling a traditional DB is different from scaling a vector DB and you'll encounter those pain points only in production and will be forced to switch to proper vector databases anyways.

link

samlambert 988 days ago

We (PlanetScale) announced Vector storage and search today. If I am understanding your request it sounds like something could do. I would love to hear more if you are willing to chat? s@planetscale.com

link

redwood 989 days ago

Absolutely this is par for the course for distributed engines.. just not postgres and other single node engines which a lot of people here will tell you are all you need.. what they mean is that's all you need until you either have to introduce enormously complex application tier sharding or you're moving to a scalable engine

link