Hacker News new | ask | show | jobs
Show HN: CozoDB, Hybrid Relational-Graph-Vector Database (docs.cozodb.org)
137 points by zh217 1158 days ago
Hi HN! We're thrilled to share CozoDB v0.6, a monumental update to our FOSS database, which already unifies relational and graph features. With the addition of vector search, CozoDB becomes an even better companion for LLMs like ChatGPT.

This release introduces vector search using HNSW indices within Datalog, enabling seamless integration with powerful features such as ad-hoc joins, recursive Datalog, and classical whole-graph algorithms. This update significantly broadens CozoDB's capabilities.

Check out the linked release note for an in-depth look at the new features, comparisons to other systems, and intriguing AI development possibilities. We'd love for you to take a look! I'll be here to answer any questions you might have.

Looking forward to your feedback!

17 comments

Link to the github repository. https://github.com/cozodb/cozo

Rust and MPL2.0 License. https://github.com/cozodb/cozo/blob/main/LICENSE.txt

Wow this is awesome.

I like the notes about automated linking between notes in knowledge tools.

One of my ideas is to represent software architecture and system architecture as vector embeddings and transform the architecture dynamically.

Can scale from one box to internet levels of traffic with a Markov chain prompt

Edit: To clarify, I want to be capable of transforming existing software architecture with a prompt, that perhaps describes attributes of the system or describes capabilities.

> I like the notes about automated linking between notes in knowledge tools.

May I ask where did you read about this? I am also interested.

How well does Cozo handle larger than RAM data? Does it only need to keep the query's answer set in memory?

By larger than RAM I mean the entire WikiData knowledge graph (~100GB), with something like 16GB of RAM.

Another question: any plans for supporting Parquet files with query pushdown? I honestly doubt Parquet's efficiency can be matched with RocksDB (but I'm happy to be proven wrong), and having to convert big datasets is always a pain...

For the parquet question: currently CozoDB is developed by a single developer (me), but I am starting to explore ways of expanding the development team. Certainly a lot more features will be added if that happens, and parquet support looks like a really useful one.
Any contact info for you to discuss contributing?
Yes this is correct, only the query's answer set need to be in memory. We are also working on streaming for the Rust API, in which case you don't even need to keep the whole set in memory for simple queries.

FYI here is a not very rigourous performance and memory usage analysis (for a previous version without the vector search capability): https://docs.cozodb.org/en/latest/releases/v0.3.html

Thanks for the speedy response!

Cozo is looking like a top-contender for my project so far :)

Would using Cozo make sense for a social network? Certainly would make modelling comments easier but how well does it handle many concurrent users?

Can you intermix different storage engines as well? So e.g. a user could have a personal storage using sqlite but also easily save to a rocksDB storage as well?

In regards to the timetravel capabilities, can this be leveraged to implement git-like features querying these historical points in time in the data?

Also just curious your thoughts on how secure data is within Cozo? Or asked another way, how production-ready is Cozo. I know it's still early days but could Cozo be used as the primary database in a product being delivered today?

Great work all around, really awesome to see!

Great questions!

- As can be seen https://docs.cozodb.org/en/latest/releases/v0.3.html, for concurrent writes about 200K QPS can be achieved with 24 threads on a pretty old server. I think it is enough for a small to medium social network.

- You can start independent instances and use them together in your user code. You can have as many as you like, but data can only be exchanged through your code: they can't talk directly to each other.

- If by git-like you mean point-in-time queries, yes that's what the feature is for. But git comes with lots of other things such as merge logic, etc. These need to be implemented outside CozoDB.

- We do use CozoDB for data storage in production systems ourselves, and we back up a lot. So far nothing disastrous has happened. Note that CozoDB does not have any meaningful concept of user/authentication/authorization (yet), so you must make sure that only trusted clients can reach it (only an issue if you use the standalone server, since the embedded DBs do not open any ports).

As a graph “fanboy” I’m impressed, humbled, and inspired by the work that’s been done already and the direction you’re heading!

> Note that CozoDB does not have any meaningful concept of user/authentication/authorization (yet)

Please please please implement the Palantir security model unless you already have a smarter idea coming down the pipe. Palantir regularly scrubs past media from the internet, but there is a blog post that has the ACL slides from the now-private video: https://onetwo.ren/级联GraphQL访问控制/

Did some digging, I found this: https://documents.pub/document/palantir-access-control.html which appears to be the full slideshow
Perfect yes, thank you.
Could this be used for creating a memory system, with weights, and the ability to rewind thought chains? Would you be interested in partnering up ? I'm not a database dev, but I have some great ideas, and I'm already reaching out to investors to build something in AI, and have a partner potentially. I'd love to build a database that is basically like the midbrain of AI, a database hybrid that is built specifically for AI memories, and memory relations. If you're open to collaborating and building a product, perhaps my ideas could be a good 'test case' and be mutually beneficial to all of us. email : patrickwcurl - gmail.
Amazing! Thank you, all very encouraging answers. Congrats on everything you've achieved with Cozo so far!!

One last question if possible. Is there a recommended way to do Full Text Search on data stored in Cozo?

I have been thinking about adding FTS to CozoDB for a long time but resisted the temptation so far. The reason is that text search is language-specific: what works for one language does not work for another. There is simply no way that CozoDB can duplicate the work of a dedicated text search engine for all the languages in the world.

Our current solution is to use mutation callbacks to synchronize texts to a dedicated text search engine. This is language specific: for example, for python: https://github.com/cozodb/pycozo#mutation-callbacks , and for Rust: https://docs.rs/cozo/latest/cozo/struct.Db.html#method.regis...

Sonic [1] might be a good fit, though it is not yet factored into a separate library [2].

[1]: https://github.com/valeriansaliou/sonic

[2]: https://github.com/valeriansaliou/sonic/issues/150

Thank you, that makes sense. Plus with vector search there seems to be ways of shoehorning FTS with it. Could also potentially use sqlite storage and piggyback off SQlite FTS5 but not sure how well that setup could work
What about branching?
You’ve itemized almost my entire wish list.

In terms of “timetravel”, I want to see exactly what an item was at a specific time (COW with metadata works decently, but I’d love graph snapshots/diffs)

And one more thing: 20% Parity data for everything that’s in the system, stored in a way that it can be verified at-rest and can also be exported then verified locally.

Yes, I know filesystems are great at reliability now but safely transferring between systems is beyond their scope

Nice, I think you've just invented the first cellular sheaf db that I'm aware of
Can you explain?
as I understand it a cellular sheaf complex assigns vector spaces to nodes and linear transformations to edges
Wow, cellular sheaves, that's a connection I haven't thought of before!
Related:

Show HN: Cozo – new Graph DB with Datalog, embedded like SQLite - https://news.ycombinator.com/item?id=33518320 - Nov 2022 (67 comments)

Well this all looks incredibly cool, and certainly beyond my current understanding of graph and vector databases.

You mentioned in the release that you wrote your own knowledge management tool. Is that published somewhere?

Not for the moment, it is not polished enough. Right now it is just a webapp written in React and prosemirror running on top of a CozoDB instance. And it is very rough around the edges (good enough for myself, but maybe not for others).

Once local LLMs that are powerful enough become available, though, I think I will try to find time to polish and publish it, since it can then act as a showcase for what a thinking agent can achieve.

How are you modeling the notes cozodb? I'd interested in parsing my Obsidian nodes into cozoDB as a way to setup alternate views on them. Curious how you thought about storing the bullets and relations between them.
Wondering how it compares to rel¹.

¹- https://docs.relational.ai/rel/primer/basic-syntax

I'm not too familiar with rel, but from what I read, rel seems to be cloud only, and CozoDB always aims to be local-first. Another difference is that CozoDB has many whole-graph algorithms, and more can be added from user code (for example using Python), which I don't see rel having.
Apologies for kind-of sidetracking, but I went back and read the blog post about Cozo’s time travel feature and wanted to add this approach: since relationships are cheap, you can have a “current status” relationship and multiple “past and current status” relationships. When the user updates their status it replaces the “current status” while also adding itself as a “past and current status”.

That way querying for the current status is a 1:1 graph lookup and you can reserve timestamp lookups for querying past statuses.

Someone should try feeding those conceptual maps back to a multimodal agent as pictures.

Also, to which extent is this related to quantum categorial grammars for NLP ?

If you want folks playing with integrating CozoDB and LLMs, it might be worth adding a CozoDB wrapper to Langchain :-)
Thanks for the suggestion--will surely do that!
would love to have CozoDB be a part of llamaindex too! have a bunch of integrations with existing vector db's https://github.com/jerryjliu/llama_index/tree/main/gpt_index...
I created this account to thank you for your work, this seems perfect for what I 'm working on! I'm about to finish the tutorial, it's very well done :)
I've come back to say that I'm truly in awe at what you're doing. Given your examples (both in the tutorial and other articles), I think we might be thinking very similar things, but you're way ahead. You're amazing!
Thanks! I'm really glad that you find CozoDB useful!
How do you handle different LLMs having different vector spaces?
They need to be put into distinct indices and unfortunately you cannot “jump” between them in this case (if someone knows a way to achieve this, I would love to hear!)
I think I’m understanding that the item’s vector in one LLM can be stored as one index and the vector in another LLM can be stored as a second index without them colliding or one having to overwrite the other.

Is that right?

Yes, actually I already do that. Sbert is better than openai ada embeddings for many use cases.
Amazing. Then as far as I’m concerned that functionally solves the problem until the industry figures out cross-embedding jumps
Is it called "CozoDB" or Cozo? Your README is inconsistent.
Sorry about that ... I will revise it to be more consistent. Cozo is a bit ambiguous, so now it is usually called CozoDB.
What is a vector database? and what makes better for LLMs?
The linked article explains these in details.
Does this allow commit/rollback on a graph DB?
Neat!
Thank you!