Hacker News new | ask | show | jobs
by jrockway 6284 days ago
No, graph databases will rule the world. Couch only supports trees, which is annoying if you want to have relationships between "documents".

Also, Couch's implementation of map/reduce arbitrarily limits the kinds of queries you can run.

If you actually want to ditch your relational database, take a look at things like KiokuDB, Elephant, AllegroCache, and so on. They may not have exciting web 2.0 screencasts, but the technology is much better. (I am biased towards KiokuDB, since I helped write it, but it has been very easy to use KiokuDB instead of a relational database; there has been significantly less code in our apps, tests have been much easier to write, and I don't think we've lost any runtime speed either. I really need to write a long blog post about this, but I haven't had time.)

It is good that the world is gradually working their way up to object/graph databases, though. Last week it was "OMG KEY VALUE STORES SOLVE EVERY PROBLEM", this week it is "DOCUMENT DATABASES WILL RULE THE WORLD", so hopefully the blogosphere is only a few weeks away from enlightenment ;)

4 comments

Well, just to state the obvious, no particular database will "rule the world".

They are different tools for different tasks.

RDMBS will not go away in data warehousing for a while. Key/Value stores are useful for many webapps where they match the access pattern. Graph DBs have yet their own area of application.

There is not "one tool to rule them all". Albeit it would be an interesting project to wrap up all those engines under a common API (SQL?) and have the server choose the optimal one either on user demand or even by magically analyzing the workload.

I'm saying that because the real uglyness that many of us are facing is that we need not one but several of the aforementioned tools for our particular app. For parts of the data we like the guarantees and integrity of RDMBS, for other parts we need the scalability of a key/value store.

Well, just to state the obvious, no particular database will "rule the world".

I don't think this either, I was just parodying the overly-editorialized title.

Hierarchies and graphs are both fundamentally navigational data models, which force a physical-logical coupling of concerns, hampering flexibility significantly.

The network model recognizes the flexibility issue with hierarchies, however I'm not sure how it can guarantee the same scalability benefits. Due to this it doesn't have the flexibility of the relational model, nor the performance characteristics of the hierarchical model.

This is probably why it isn't ruling the world.

This is probably why it isn't ruling the world.

Technical merit and popularity are only occasionally related in the computer world, so I won't comment on this.

I will elaborate further on the use case of graphs versus a relational database. Basically, I see relational databases as especially useful for cases where you have a big pile of data, and have no idea what it really means. You do queries to learn what you have. ("Aha, in March, people from Illinois buy more of product 23894735 than people from California.") The relational model is good for this, since it doesn't build any preconceived notions into your data.

However, most applications don't need anything like this. They have a well-defined data-model, and rarely run "queries" (except to work around the fact that that's the only interface to their data). In this case, the relational model is basically being used as a dumb key/value store supplemented with joins. This results in a ton of code in the application to translate in-memory structures to something that can be stored in the database.

If you use a graph database, you can just store your in-memory structures directly, and get them back later. (Most good object database give you other features, like the ability to index your data so you can still run searches efficiently. Kioku and Elephant do this, anyway.)

So really, you should use the right tool for the job. Want to persist and search in-memory structures? Use an object database. Have a big pile of data you need to make sense of? Use a relational database.

You are allowed to use more than one, they are tools, not religions.

(FWIW, I see relational databases as filling a very specialized role, and I see object databases as the general thing you should use when you want to persist some state. The rest of the world seems to have gotten this backwards, which is somewhat depressing. I think it's because people think databases are magical, as a result of not understanding how they actually work.)

I sort of see where you're coming from in terms of the parity mismatch between RDBMSs and typical application code. Certainly we can avoid some coding headaches by just dumping things into a data store that is optimized for what we want to do with that data right now. But then you say that most applications have a well-defined data model and rarely run "queries". This is where I think you've gone terribly terribly wrong.

The value of a relational database is that it most agnostically represents the reality of what the data represents. It's not about "big piles of data" or "making sense of your data", relational databases are about making your data as expressive as possible. You're selling this idea that most applications only use data in a few predefined ways. I have to say that sounds like a complete pipe dream. Requirements change all the time. Reporting needs often are not even conceived until you have hundreds of megabytes of data. Let's not even get into multiple applications using the same database.

In every business I've ever been involved with, the data is always more valuable than the code, and it always outlives the code. Too much of the hype around these alternative database technologies are throwing the baby out with the bathwater. The idea that "most" applications don't need structured data just strikes me as incredibly naive and short-sighted. Far more applications need structured data than need to scale.

Well, let's think about this in more detail.

Let's say you have two types of data, customers and orders. Customers have many orders, an order belongs to a customer. This is easy enough with a typical relational database. You have a customers table and an orders table. You can join the tables and ask questions like "how many customers spent more than $300 last year?"

Now let's consider the graph/object database equivalent. You have two classes, Order and Customer. A Customer has a set of Orders, and the Order has a customer. (Cycles are fine, this is a graph, not a tree.) Creating an order works with some method like $customer->new_order_for('some pants'). You store this in the database, and the graph structure is stored and indexed. (Usually, object databases index on class name, but you can always specify other conditions. This makes it basically equivalent to the relational database.) Note that this structure works very well; the in-memory relationship is the same as the in-storage relationship. You can also write the same query as with the relational database. Get all customers, find their orders from this year, and sum the totals. (Instead of writing SQL, you would just write a script here. You can index things like the order year to speed up the query, as well. Otherwise, it's O(n), but so is the relational database without an index. There is no magic after all.)

Anyway, there is no lack of flexibility with the graph database. If you want to query your data, you can. It's just less convenient, since you have to write a program to do it, instead of letting your database management engine do it. (This is actually not true in general, AllegroGraph has a querying engine based on prolog.)

Back when I did data warehousing, we had to move all our data from the web app servers to a warehousing server in a specialized schema so that some GUI software could manipulate the data. Even though we used a relational database, we had to convert anyway. Using an object database would have made the app code simpler, and the warehousing code equally complex. So I think that would be a gain, not a loss.

The flexibility of the relational model is built on its bare simplicity: values + sets + logic. If you propose adding to this and giving up the benefits provided by these simplifications (simplicity of reasoning, ability to change the physical implementation without affecting the logical one [those not-magic indexes], declarative constraints, declarative queries), you need to have a really good reason.

The network model if it means anything is letting you add pointers to the mix and changing how you reason about your data from sets to graphs. This makes it harder to declare constraints, check integrity, update sets of data, access your data in different ways, and reason about your queries. An imperative script is in no way as safe a program as a declarative query.

The most clear benefit in my view to the network model it lets you use your object code fairly seamlessly. Ok.

The problem is that this is a good trade off for programs where you don't care primarily about your data, but about your code. I'd argue that isn't true for the majority of programs that have databases at all. The issues that kill your system years down the line and give you nightmares are not code issues, they are data issues. And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

Or you won't. I have many important KiokuDB applications in production. And they are not toy Web 2.0 things, they are important sites that perform important data analyses. I don't think any of the logic I had to write to support was particualarly difficult, and it was fewer lines of code than I would need to define my ORM classes.

The flexibility of the relational model is built on its bare simplicity: values + sets + logic.

Simplicity is good. However, software is complex, and complexity has to live somewhere. Look at git, for example. The underlying model is simple and beautiful, blobs, trees, and commits. Wonderful. But, to make that beautiful model into a revision control system, thousands of lines of code had to be written. So while simplicity makes the bottom part of the system simpler, it didn't do much for the overall simplicity of the entire system.

I feel the same way about object databases. They are more complex than a relational database (unless you count things like replication and embedded scripting and ... as relational database features, which I don't), but they help me decrease the complexity of the code I see.

As an example, when I use an RDBMS, I have to maintain:

SQL schema + upgrades

ORM classes

Logic classes (for abstractions over multiple tables or data stores)

Random scripts to query the database and give me munged reports

With an object database, I only have to write the logic classes and the random scripts. The random scripts are slightly more complicated, but not significantly more. The main app is much simpler, and much easier to test. I also don't have to hack my data model into the relational model. (Ever represent a tree in a relational database? It's a hack.)

I've looked at AllegroCache before, and really liked it. Can you briefly discuss how a graph database of type you describe (AllegroCache or KiokuDB) differs from the object-oriented databases which have not exactly taken the world by storm?

Edited: I also just Googled around a bit and found something called Neo4j. Do you know anything about it?

Can you briefly discuss how a graph database of type you describe (AllegroCache or KiokuDB) differs from the object-oriented databases which have not exactly taken the world by storm?

Basically, object databases and graph databases are conceptually the same. I have only used graph databases to store object graphs. Instances are nodes, and "has a" relationships are edges.

I don't know why this hasn't caught on, as it's made my life significantly easier (by making my software simpler and easier to test). My guess is that there hasn't been a good implementation in a popular language. Perl and Lisp are widespread, but not nobody is going to start using either because there is a nice object database for them.

(Neo4j looks nice, and I don't know why its not more popular. Perhaps it needs a good object-storage frontend, and this is hard to write since Java's MOP is pretty limited. Perl and Lisp both have excellent MOPs, Moose and CLOS, so this makes it easy to write good object databases. With Java, you will have to build your own MOP, which is hard. Same goes for C++ and C#, which are the other "popular" languages.)

I agree, in principle that the most dominant, important technology will be graph databases. The reason I think so is because of experience working in technologies related to the semantic web.

However, there are reasons one might a Distributed Document Store to implement a graph store - it's the index that matters. MySQL is not the same thing as InnoDB/Sphynx. Graphdb product X could be built on mapreduce +CouchDB+Lucene(+somegraphindex) .

The only thing I'm sure of, is that if anything stands out above the rest, Oracle and IBM (and maybe MS) will try to buy it.