Hacker News new | ask | show | jobs
by songrenchu 1045 days ago
Thank you for the insightful topic! By reading the question itself drive me think a lot.

For the database perspective, instead of dividing the table schema into 3 parts: id, metadata, embedding, we designed in a way closer to SQL, treat vector as another data type, and let user to define any number of fields in a table. ID is just an annotation of a field (composite key might be overkilling for now). There will be another debate on whether schemaful or schemaless is the right approach, we can leave it here for now

With this foundation, we already covers 1 and 2. And in our roadmap we also plan to cover 3, with multi-modal data type support. We think the real big advantage of embedding is on unstructured data (documents, images, video, audio, etc), and storing the embedding of multi-modal data and connect them through semantic relevance will open up big opportunities. And this fits with the table and fields-based design for introducing cross table embedding index on connecting different shape data.

And from the multi-modal data perspective comes the problem where do we store those data? One way is we provide a generic binary data type that let users put anything. Another way which most enterprise will do is integrate us with a larger data warehouse/data lake system. And this opens up the requirement for us on supporting data streaming in/out with kafka connector, spark connector, etc.

And totally agree that SQLite works so well in huge amount of scenarios, now there is DuckDB. We also see some other players like LanceDB taking this approach to be Vector DB space's SQLite. We are also pretty close to announce our Python in-process package support, so docker / a separate server is not a must have anymore.

For inference, this is a broader direction for us for now. We are open to explore this space and see if the serverless architecture on cloud can provide extra efficiency benefit to the market

1 comments

Thanks for the thoughts. I agree that you're not going to disintermediate existing datalakes, no matter how successful, so integration makes sense.

Every few months I run up into a use case where I'm like "I want to get a whole bunch of data, analyze it, then search for it later with embeddings, and probably keep running different sorts of analysis on it, and store the embeddings of those analyses in a related way." This still feels fairly difficult to do, or at least there aren't canonical "right" architectures yet.

My instinct is if you nail the ml+dev+data ops needs with good architecture and api you could really have something -- good luck!