Hacker News new | ask | show | jobs
by yummyfajitas 5585 days ago
Suppose you have a graph with edge weights.

Query: find all the nodes a distance < D from some reference node.

With SQL92 or earlier, this will require an unbounded number of queries. Proof: A SQL92 query can only traverse a finite number of edges, since it has only a finite number of joins (say K). Consider the graph 1->2->3->...->N with edge weights 1/N. This will require N/K queries to traverse. N is arbitrary.

It's probably doable in SQL99 and various vendor specific extensions (I think it's Turing Complete), but it is unlikely to be fast.

So, as is the normal situation, you can do whatever you want in SQL. But the performance is likely to suck.

(That said, I agree with you that many people are using NoSQL because it's the hip new thing, when they should just use SQL. "Schemaless flexibility" is BS - SQL is very flexible if you use an ORM + migration tool. The only flexibility you lose is the flexibility to screw up the data in application logic.)

1 comments

Trivial with SQL-standard recursive queries:

WITH RECURSIVE r AS (SELECT parent, child, distance FROM graph UNION ALL SELECT r.parent, graph.child, r.distance + graph.distance AS distance FROM graph, r WHERE graph.parent = r.child) SELECT parent, child FROM r WHERE distance < 3.14159;

Testing your graph with N=1000 on a teeny VPS. 477,897 rows returned in 6.1s.

If we move the WHERE clause inside the sub-SELECT (which unfortunately makes the query slightly more awkward to use), selecting lower distances reduces the compute time. For example, a bound of 0.5 returns 196,238 rows in 2.5s.

What sort of performance gains does a graph database give for this same query?

As I said, you can probably do it with SQL99 query, them being turing complete and all.

I don't know how your database is implementing things under the hood, so I don't know. I'll just list the major advantages of a graph database:

1) Edges stored with nodes. You can almost certainly read an edge and all it's neighbors with a single disk seek. This is unlikely to be the case for a SQL system.

2) Compute time is always O( portion of graph traversed), not O(graph size), and is always under the control of the user. In SQL, this will vary quite a bit with implementation.

3) You can specify various heuristics for the traversal - depth first, breadth first, etc. This is important if the specific nature of your data will affect the speed of the query.

Ah thank you, this answers my original question. In response:

1) I contest that this is "unlikely" for SQL -- MySQL (unless I'm misremembering) orders data on disk by its primary key, which would be an equivalent optimization. (PostgreSQL does not do this to my knowledge.)

2) In my two queries above, I demonstrate both O(graph) and O(traversed) (in that order). At least in PostgreSQL, this is very much under control of the user as well, since WITH queries are evaluated in a specific order.

3) There is no way to specify such things in SQL. However most SQLs are designed with the idea that the query planner is smarter than you (it almost always is, at least in PostgreSQL), so you shouldn't be trying to specify execution plans anyway.

Overall, I see no reason for graph databases & SQL to be separate. It seems that if graph DBs really only address performance issues as you claim, that one is throwing out the baby with the bathwater by creating a new kind of database rather than simply adding specialized optimizations to e.g. PostgreSQL.

1) Ordering data on disk by primary key is irrelevant. You look up the node - one seek. Look up the edges - another seek. Look up the nodes pointed to by the edges - several more seeks.

In fact, looking up the edges is multiple seeks. For an edge (A, B), if the primary key is the row itself, edges are only stored in sequence if you are starting at node A.

Graph DB's usually cheat by store edges on both nodes, if possible within the same page as the node.

2) I'll take your word for it.

3) I generally agree that the query planner is smarter, but I'd be very surprised if that is true for complicated recursive queries. In particular, I really doubt that a query planner can exploit the structure of your data to improve things.

I.e., I really doubt the planner will know whether a DFS or BFS is better, or come up with heuristics to select the optimal path to go down (this is heavily dependent on data). Graph databases tend to provide such functionality.

I absolutely agree with you that better graph support for a SQL DB would be ideal.