Hacker News new | ask | show | jobs
by bjornsing 2008 days ago
How does this perform compared to a “native” graph database like Neo4J?
4 comments

It really depends on what you want to do with it.

I would benchmark the tasks "traversal", "aggregation" and "shortest past" for a 10k to 10M node graph. Anything under 10k would be good enough with most techs and over 10M need to consider more tasks (writes, backup, the precise fields queried can become their particular problems at larger scale).

The Github link implements "traversal "in Python instead of pure SQLite. I suspect it will be around x10 slower than it could be with the same tech stack, because it queries once per node from Python to SQLite. Shortest path is not implemented and would be too slow to be useful in an interactive environment. "Aggregation" is also not implemented, but it would perform admirably, because SQL is good at that.

Traditional relational OLTP databases such as Postgres are already faster than dedicated graph databases for certain graph related tasks, according to this benchmark: https://www.arangodb.com/2018/02/nosql-performance-benchmark...

> Traditional relational OLTP databases such as Postgres are already faster than dedicated graph databases for certain graph related tasks

It is indeed quite common that relational databases outperform graph databases on certain graph processing problems such as subgraph queries (a.k.a. graph pattern matching). There are two key reasons for this: (1) most graph pattern matching operations can be formulated using relational operations such as natural joins, antijoins, and outer joins; and (2) relational databases have been around longer and have well-optimized operators.

A lot of the value that graph databases provide lies in their query languages which (for most systems) allow formulating path queries using a nice syntax (unlike SQL's WITH RECURSIVE which many people find difficult to read and write). Their property graph data model supports a schema-optional approach, which makes them better suited for storing semi-structured data. They also "provide efficient programmatic access to the graph, allowing one to write arbitrary algorithms against them if needed" [1].

With all these said, graph databases could be much faster on subgraph queries than relational databases and there are recent research results on the topic (worst-case optimal joins, A+ indexes, etc.). But these are not available in any production system yet.

[1] http://wp.sigmod.org/?p=1497

> "shortest past"

shortest path typo, right?

The Open Shortest Past First protocol is used to resolve temporal paradoxes.
Neo4j has failed queries I have written, with "out of memory" errors. I have never, ever, ever gotten that from SQLite.
Performance issues are a very valid discussion. But to me, the availability of a graph-oriented query language on top of this graph variant of SQLite is, imho, the very first step to investigate. (RDF import/CSV import being next)
There has been a lot of progress on creating standardized query languages for graphs. The two most notable ones are [2]:

- SQL/PGQ, a property graph query extension to SQL is planned to be released next year as part of SQL:2021.

- GQL, a standalone graph query language will follow later.

While it is a lot of work to design these languages, both graph database vendors (e.g. Neo4j, TigerGraph) and traditional RDBMS companies (e.g. Oracle [2], PostgreSQL/2ndQuadrant [3]) seem serious about them. And with a well-defined query language, it should be possible to build a SQL/PGQ engine in (or on top of) SQLite as well.

[1] https://www.linkedin.com/pulse/sql-now-gql-alastair-green/

[2] http://wiki.ldbcouncil.org/pages/viewpage.action?pageId=1062...

[3] https://www.linkedin.com/pulse/postgresql-oracle-graph-query...

have SPARQL and Gremlin not seen adoption as standard graph traversal languages? They're the two names that spring to mind when I think "graph querying".
I second that. I have not followed the news about the Gremlin-to-SPARQL (or SPARQL-to-Cypher) bridge. But afaiu, making your graph system Gremlin-compatible is a first step in the right direction. (And yes, doing that on top of a SQL backend sounds not that natural).
Both SPARQL and Gremlin have been adopted to some extent. SPARQL is a W3C standard and Gremlin is reasonably well-specified (it has good documentation and a reference implementation), so it's possible to implement a functionally correct SPARQL/Gremlin engine with a reasonable development effort.

Gremlin's main focus is defining traversal operations on property graphs. While it supports pattern matching [1], IMHO its syntax is not as clean as Cypher's. Gremlin queries are also difficult to optimize: while it is possible to define traversal rewrite rules, they are more involved than relational optimization rules. The fact that most open-source Gremlin implementations are focusing on distributed setups (e.g. a typical deployment of Titan/JanusGraph runs on top of Cassandra) has also implications on single-machine performance, which certainly did not help the adoption of Gremlin -- but this is not necessarily the problem of the query language. Overall, Gremlin is great for workloads where complex single-source traversal operations do the bulk of the work but it's less well-suited to global pattern matching queries such as the ones in the LDBC Social Network Benchmark's BI workload [2].

SPARQL focuses on the graph problems of the "semantic web" domain, which include not only pattern matching but semantic reasoning/inferencing. One can use it for pattern matching queries but with the following caveats:

- Its data model is based on triples so if one wants to return a node and its attributes (properties), one has to specify each of these attributes explicitly.

- On the execution side, returning these attributes might necessitate executing a number of self-join operations.

- Many SPARQL implementations also have performance limitations due to the extra complexity introduced by self-joins, lack of intra-query parallelism, etc.

The "RDF* and SRARQL* approach" is an initiative to amend the self-join problem by introducing nested triples in the data model. It's currently being worked on by a W3C community group [3]. SPARQL also has "property paths", which allows regular path queries, i.e. traversals where the node/edge labels confirm some regular expression (the "property" in "property paths" has nothing to do with "property graphs").

SQL/PGQ and GQL target the property graph data model and support an ASCII-art like syntax for pattern matching queries (inspired by Cypher). They also offer some graph traversal/shortest path operations (including shortest path and regular path queries). Additionally, GQL supports returning graphs so it's queries can be composed.

[1] https://en.wikipedia.org/wiki/Gremlin_(query_language)#Decla...

[2] https://ldbc.github.io/ldbc_snb_docs/workload-bi-reads.pdf

[3] https://blog.liu.se/olafhartig/2019/01/10/position-statement...

Isn't it like asking "how does sqlite perform compared to databases like PostgreSQL" ?

SQLite is used a lot on edge (mobile apps, ...), sounds like this project provide a graph database for the very same use case (I probably won't run Neo4J on mobile).

IMO it’s a different question. SQLite and Postgres are both relational databases, it stands to reason that they’re at least doing things in similar ways. They’re two implementations of the same idea(ish). A graph database is something else altogether. Grafting that capability onto a relational database has the potential to perform horribly.

It’s a bad analogy, but SQLite to Postgres is like AMD vs Intel x86 CPUs, whereas a graph database is ARM. Can it be emulated? Yes. Is there a far greater potential for slowdown? Yes.

I think the big difference here is that when comparing SQLite you are at least using the same query language.

In the graph space you have Gremlin, Cryper, GQL and many other proprietary query engines (which also looks to be the the case here).

Without that accessibility this feels a bit like pickling a NetworkX object.

Then maybe no such questions are necessary.
It depends on how the graph is stored in the database. In this project the nodes ids are TEXT so it will likely not scale very well. I know because I use a similar implementation with GUID as string in Sqlite in a project since a couple of years and while it works fine for the graph I have (<1 million nodes, few edges per nodes) it won’t perform too well past that.
To some extent I think it depends on what data you're storing in the graph ie. If it's temporal data using a ulid instead of a guid speeds things up significantly (30x for large data) as your ids are not as fragmented.

https://github.com/schinckel/ulid-postgres/blob/master/ulid....

Thanks for the info. Do you happen to have some stats to share?