Hacker News new | ask | show | jobs
by jamesfisher 4166 days ago
I've been using Neo4j for around 3 months, but I'm unconvinced by the "graph" model as having any significant advantage over the relational model. A graph can be trivially modelled relationally as a "vertex" table and an "edge" table. A particular problem I have with the graph model is that edges cannot be referenced by other entries in the database; thus the schema designer must prefer using vertices to model all concepts, in case future data has to make reference to it.

The "graph model" was, in the 60s and 70s, called the "network model" [1]. This is the environment in which Codd introduced the relational model [2], where he wrote:

> In previous work there has been a strong tendency to treat the data in a data bank as consisting of two parts, one part consisting of entity descriptions (for example, descriptions of suppliers) and the other part consisting of relations between the various entities or types of entities (for example, the supply relation). This distinction is difficult to maintain when one may have foreign keys in any relation whatsoever. In the user’s relational model there appears to be no advantage to making such a distinction (there may be some advantage, however, when one applies relational concepts to machine representations of the user’s set of relationships).

It's quite funny that after the elegance of Codd's work, we're thinking about going back to the network model and all its problems. If we're going to consider that, we need very good reasons, but I haven't seen any.

I do see some features of Cypher as nice compared to SQL. Specifically those features that implement graph algorithms (e.g. shortest path). I do think there is space in the relational world for languages other than SQL. For example, it is disappointing that no relational databases implement Datalog as a query language. However, it is important not to conflate "SQL" and "the relational model".

[1]: https://en.wikipedia.org/wiki/Network_model [2]: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

3 comments

The advantage may just be the way you interface with the data, SQL vs Cypher/Gremlin(Groovy). I know with Gremlin one can easily walk the graph, write lambda functions to evaluate certain points while you traverse, create easy algorithms to do breadth/depth-first traversals, etc. all as a query. Those things just don't seem possible with SQL alone. Using the setup of a node and edge table, how could I write a query that would start with a node, find all edges of a type and go X levels deep, Y nodes wide, keep reference points that you can then go back to and restart the query from. Maybe write some sort of SQL function, but those things are hideous.

It would be interesting to see a script-type language on top of a traditional relational db.

The main reason is speed. Graph traversal is constant time as opposed to O(log n) for each traversal. The speed only becomes relevant later, and you can therefore add graph databases as a slightly delayed replica on top of a db for things you need to traverse fast.
This is true, however I think this again conflates two things: the relational model and the "tabular" data layout. I gather that databases like Neo4j obtain constant time traversal by storing each node's list of "relationships" at the same location as the node itself. We might be able to generalize this to a relational database, where every row in the database is stored together with the list of addresses of all other rows which reference it via a foreign key.
How? Describe a detailed solution.
PostgreSQL has an "array" data type (http://www.postgresql.org/docs/9.4/static/arrays.html), as do other databases. You could use that for that purpose (I know of research prototypes that have attempted this). It is unlikely to work well in general, especially given that degree distributions are often skewed (i.e., there is very high spread in the number of relationships per node) -- if I had the time, I would do a proper empirical comparison.
Couldn't this be improved with new data types and types of indexes in a relational db, similar to spatial types/indexes?
datalog is slow, even if performance was improved it still, not competitive with SQL. That said is really handy.
I'm told that Datalog is equivalent to SQL with recursive subqueries, which means that any Datalog query could be compiled to SQL. (Whether it's feasible to compile to efficient SQL, I don't know.)
I'm not aware of such a thing.

Really this is based on my readings and few experiments, I'm not 100% sure that datalog can not be made faster and socialite (see below) seem to claim I'm wrong.

Datalog is the native query language of datomic. I can put the finger on it, but I think there is a project for querying cassandra with datalog+clojure.

This is the most recent work I'm aware of:

- python querying of hadoop with a datalog-like language http://socialite-lang.github.io/

- overview: http://fr.slideshare.net/Hadoop_Summit/t-325p211seo

Also best way to learn datalog: http://www.learndatalogtoday.org/