Hacker News new | ask | show | jobs
by SkyPuncher 94 days ago
Every time I look at graph databases, I just cannot figure out what problem they're solving. Particularly in an LLM based world.

Don't get me wrong, graphs have interesting properties and there's something intriguing out these dynamic, open ended queries. But, what features/products/customer journeys are people building with a graph DB.

Every time I explore, I end up back at "yea, but a standard DB will do 90% of this as a 10% of the effort".

3 comments

In virtually all cases, you want a normal relational database and a sensible schema. Far easier and fewer sharp edges. Reaching for a graph database should never be the default choice.

A handful of data models have strongly graph-like characteristics where queries require recursive ad hoc joins and similar. If your data is small, this is nominally the use case for a graph database. Often you can make it work pretty well on a good relational database if you are an expert at (ab)using it. Relational databases usually have better features in other areas too.

If you have a very large graph-like data model, then you have to consider more exotic solutions. You will know when you have one of these problems because you already tried everything and everything is terrible. But you still started with a relational database.

A standard DB ala Postgres will be a perfectly functional graph database unless you're doing very specialized network analysis queries, which is not what most of these "knowledge graph" databases are being used for. It's only querying and data modeling that's a bit fiddly (expressing the "graph" structure using SQL) and that's being improved by the new Property Graph Query (PGQ) in the latest SQL standards.
It'd be great if PG came with a serverless/embeddable mode, that'd be the main missing thing in comparison to this tool.

I know pglite, and while it's great someone made that, it's definitely not the same

I maintain a fork of pgserver (pglite with native code). It's called pgembed. Comes with many vector and BM25 extensions.

Just in case folks here were wondering if I'm some type of a graphdb bigot.

This is the same topic I had an intense argument with my coworkers at the company formerly called FB a decade ago. There is a belief that most joins are 1-2 deep. And that many hop queries with reasoning are rare and non-existent.

I wonder how you reconcile the demand for LLMs with multihop reasoning with the statement above.

I think a lot what is stated here is how things work today and where established companies operate.

The contradictions in their positions are plain and simple.

There are worst-case optimal algorithms for multi-way and multi-hop joins. This does not require giving up the relational model.
I maintain LadybugDB which implements WCOJ (inherited from the KuzuDB days). So I don't disagree with the idea. Just that it's a graph database with relational internals and some internal warts that makes it hard to compose queries. Working on fixing them.

https://github.com/LadybugDB/ladybug/discussions/204#discuss...

Also an important test is the check on whether it's WCOJ on top of relational storage or is the compressed sparse row (CSR) actually persisted to disk. The PGQ implementations don't.

There are second order optimizations that LLMs logically implement that CSR implementing DBs don't. With sufficient funding, we'll be able to pursue those as well.

CSR is an array-based trie hence very costly to update. It can serve as an index for parts of the graph that basically will almost never change, but not otherwise.
Makes it a good match for columnar databases which already operate on the read-only, read-mostly part of the spectrum.

Perhaps people can invent LSM like structures on top of them.

But at least establish that CSR on disk is a basic requirement before you claim that you're a legit graph database.

That's coming to Postgres 19 this year, had a brief exchange with a committer earlier this week and it's actually available in the Postgres repo to try (need to run your own build of course). Very exciting development!
For starters, LLMs themselves are a graph database with probabilistic edge traversal.

Some apps want it to be deterministic.

I'm surprised this question comes up so often.

It's mainly from the vector embedding camp, who rightfully observe that vector + keyword search gets you to 70-80% on evals. What is all this hype about graphs for the last 20-30%?

"LLMs themselves are a graph database with probabilistic edge traversal" whaat?

Do you have any good demos to showcase where graph DBs clearly have an advantage? Its mostly just toy made demos.

vector embeddings on the other hand no matter how limited clearly have proven themselves useful beyond youtube/linkedin thought leader demos.

It comes from people who develop LLMs. Anthropic and Google. References below.

My other favorite quote: transformers are GNNs which won the hardware lottery.

Longer form at blog.ladybugmem.ai

You want to believe that everything probabilistic has more value and determinism doesn't? Or that the world is made up of tabular data? You have a lot of company.

The other side of the argument I believe has a lot of money.

https://www.anthropic.com/research/mapping-mind-language-mod...

https://research.google/blog/patchscopes-a-unifying-framewor...

Not sure how that was the take away from both the posts above.

I read the blog post and your website but unfortunately didnt help change my perspective.

Thanks for the share