Hacker News new | ask | show | jobs
by vessenes 1045 days ago
I'm curious about your approach on where you draw the line for database features; I don't have a perspective on what's right, just trying to get informed.

There are a bunch of possible areas to circle or ignore when making an ML-capable database of some sort. In rough order of data complexity:

1. Embeddings (context-free vectors, just an ID and the vector)

2. Metadata + Embedding (source data, JSON)

3. Binary Data + Metadata + Embedding (add documents)

Then there are tooling questions: in this matrix you'd want to decide if you're going to allow inference, and if so, will it be arbitrary, service-based, etc. against the documents, and if so, how will you store the results?

I'm curious how you're thinking about the design space. The embedding-only route is conceptually appealing because it's simple. In a larger engineering project, there's a tension between "where do I keep all this data," "how do I process and reprocess all this data", and "where do I keep the results of all the processing", and to me there aren't clear bright-line architectures that seem "best of".

Put another way, 15 years ago, we went memcached -> redis 1 -> redis (whatever it is now), and at the same time, we went mysql/postgres/oracle -> nosql json stores; today all of these have relatively well-defined use cases, (and for most of them sqlite is the best choice, obviously).

How are you seeing the ML db scene playing out, and where do you think the sqlite of this space will land on architecture?

1 comments

Thank you for the insightful topic! By reading the question itself drive me think a lot.

For the database perspective, instead of dividing the table schema into 3 parts: id, metadata, embedding, we designed in a way closer to SQL, treat vector as another data type, and let user to define any number of fields in a table. ID is just an annotation of a field (composite key might be overkilling for now). There will be another debate on whether schemaful or schemaless is the right approach, we can leave it here for now

With this foundation, we already covers 1 and 2. And in our roadmap we also plan to cover 3, with multi-modal data type support. We think the real big advantage of embedding is on unstructured data (documents, images, video, audio, etc), and storing the embedding of multi-modal data and connect them through semantic relevance will open up big opportunities. And this fits with the table and fields-based design for introducing cross table embedding index on connecting different shape data.

And from the multi-modal data perspective comes the problem where do we store those data? One way is we provide a generic binary data type that let users put anything. Another way which most enterprise will do is integrate us with a larger data warehouse/data lake system. And this opens up the requirement for us on supporting data streaming in/out with kafka connector, spark connector, etc.

And totally agree that SQLite works so well in huge amount of scenarios, now there is DuckDB. We also see some other players like LanceDB taking this approach to be Vector DB space's SQLite. We are also pretty close to announce our Python in-process package support, so docker / a separate server is not a must have anymore.

For inference, this is a broader direction for us for now. We are open to explore this space and see if the serverless architecture on cloud can provide extra efficiency benefit to the market

Thanks for the thoughts. I agree that you're not going to disintermediate existing datalakes, no matter how successful, so integration makes sense.

Every few months I run up into a use case where I'm like "I want to get a whole bunch of data, analyze it, then search for it later with embeddings, and probably keep running different sorts of analysis on it, and store the embeddings of those analyses in a related way." This still feels fairly difficult to do, or at least there aren't canonical "right" architectures yet.

My instinct is if you nail the ml+dev+data ops needs with good architecture and api you could really have something -- good luck!