Hacker News new | ask | show | jobs
by derefr 1108 days ago
Graph databases are about as different from RDBMSes storage-wise, as column-stores are from row-stores. It comes down to how you plan to shard data and distribute queries when data doesn't fit on a single node.

Using a graph DB with many underlying KV-store nodes, you can have a single graph spread over many machines representing e.g. Facebook's social graph, and run a query which "chases around" edges between vertices that live on different nodes, to solve that query, while ensuring that as little of that has to happen as possible — both by rebalancing vertices so that data is sharded at low-connection-degree points in the graph; and by consolidating the steps of queries that occur on the same node into single batch queries, such that the whole thing becomes (close to) a single map/reduce step.

There's nothing in Postgres that knows how to do that; if you had e.g. a graph stored in a Citus hypertable, and did a recursive CTE over it to do graph search, then you'd get pretty dang bad perf.

4 comments

> It comes down to how you plan to shard data and distribute queries when data doesn't fit on a single node.

A problem everbody would love to have but pretty much nobody actually has.

> A problem everbody would love to have

Except the people who do have it and need to keep their business running off of one postgres instance.

You can have data that fits on one machine and still run multiple instances of postgres in a failover configuration, which will probably cover just about everyone (depending on your filesystem, disk for a single instance is essentially infinite, so I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration).
> I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration

It's usually not that data doesn't fit on one machine but that load on the database exceeds what one machine can serve. A failover configuration might enable you to use the spares for some read operations and take a little load of the primary, but you lose ACID semantics when you do that and it generally doesn't help you for long.

There is a reason the world moved on from failover architectures.

a) At some point you will have more data or more users than one instance can handle. And instead of simply adding another node you need to throttle usage in order to do a rolling upgrade. Which is far easier said than done and involves impact to the business.

b) With distributed databases you are constantly testing that everything works in Dev, Test etc environments. With failover you really only test it every now and again usually before you deploy to Production. And in most companies which are hopeless the testing will be guaranteed to be inadequate.

c) Vendors lie. They promise that failover will just work but in my experience it very often doesn't. Which is another reason why b) is such important to validate their claims.

I mean, I don't like that Postgres is not infinitely scalable, but the whole point is that (a) is not generally true--most companies could probably get by with a single machine's worth of data, or rather if they have more than one machine's worth of data, those systems probably aren't talking to each other such that they need to be on the same box. Regarding (b) and (c), do you not need to test failure conditions for distributed databases (this isn't rhetorical, I've only ever used cloud providers offerings)?
> most companies could probably get by with a single machine's worth of data, or rather if they have more than one machine's worth of data those systems probably aren't talking to each other such that they need to be on the same box

one of the things I would like to see in my lifetime is somehow it should be easier to "run the whole enterprise" from one box. Sure, it will probably be seriously underpowered and I can't do all things at once but for most small to mid-size companies, it should be possible to run all our "code" from one machine.

I think of this as some kind of development or pre-qa environment. It really shouldn't be that big of an ask...

I am thinking most, if not all, companies will be able to fit their entire enterprise on a Supermicro A1+ server with two 96-core processors. Sure, there is no machine in the world that can fit all of YouTube videos but there is no reason why we can't have YouTube, with a limited set of non-production data, running from just one box. Thoughts?

Interestingly, when a place does get to the point where the single instance has capacity issues (after upgrading to EPYC and lots of flash drives) then other non-obvious stuff shows up too.

For example, at one place just over a year ago they were well into this territory. One of weird problems for them was with pgBadger's memory usage (https://github.com/darold/pgbadger). That's written in perl, which doesn't seem to do garbage collection well. So even on a reporting node with a few hundred GB's of ram, it could take more than 24 hours to do a "monthly" reporting run to analyse PG usage for the time period.

There wasn't a solution in place at the time I left, so they're probably still having the issue... ;)

I consult for a lot of companies and I never heard of or seen a database that wasn't horizontally scaled.

It's not for scalability reasons it's for high-availability.

Which as cloud adoption has increased and server uptime has decreased is even more important.

Some of these arguments and “common knowledge” things are getting old. Everybody scaled up twenty year ago - hell Amazon used to brag that they used an HP Superdome or whatever.

Anyone with dogmatic opinions about this stuff need to be taken with a grain of salt. If you scale out PeopleSoft, your accounting system will exceed the value of your company. If you’re worried about webscaling your random app, that’s more wasting time navel gazing than accomplishing anything! :)

Why shard when you can just replicate?
Because replica failover is rarely seamless (and often doesn't actually work at all, IME).
Instinctively that's surprising... replica failover should be far simpler technically, shouldn't it?
No? Replication tends to be a bodged-on mess throughout, full of undertested edge cases, of which failover is definitely one. If you build the system so that nodes joining and leaving is a natural and normal part of operation, well, it naturally works a lot better.
The PGQ part of SQL is designed in such a way that an RDBMS can certainly optimize its storage of graph tables (eventually). ("Graph tables" used by PGQ have DDL setup, per the Eisentraut overview: CREATE PROPERTY GRAPH graphname…) If Postgres embraces PGQ it would be easy to imagine they would also adopt some graph database style storage to back its indexing and query planners.
> There's nothing in Postgres that knows how to do that

Postgres uses foreign data wrappers and table partitions to achieve sharding of relational databases over many nodes. It's certainly possible to make the FDW layer smarter wrt. being able to distribute certain queries to the shards, including some recursive queries that happen to be of practical interest.

if you are Facebook scale, yes. But then you anyway will use a dedicated graph database.

But a lot of graph databases have sizes where they are fully replicated on each node and some small internal optimizations for the graph are good enough. Given the design of the graph query language (and what you can query, hint: not everything arbitrarily) there is no reason postgres can't gain some additional optimizations to do that effectively as long as it's not facebook scale.

SQL now having syntax for querying graph means Graph databases can support SQL, at least a subset of it, in a standardized way.