|
That's not quite what I was going for. I enjoyed your article because I agree with your overall point that distributing stuff has an overhead, and also it points to a definite problem in how some research work is portrayed - possibly having to do with what the incentives are in reviewing and publishing. However I think you're taking your argument to the extreme in a way that doesn't really apply here. First, from what I understand, graph databases are still far from well understood and don't really represent the best in query optimization. This is in no way representative of RDBMS query optimizers for usual OLAP/OLTP tasks, and not what we're talking about right now. Something like SAP HANA or Redshift or experimental systems like HyPer and MonetDB or even something like impala would represent that literature better here. Or check out MapD, which uses GPUs for parallelism. Or kdb+, which has existed for forever and is well-known to do a great job parallelizing and offers a rich query syntax. Indeed, when I look at the emptyheaded paper for example, I see SIMD parallelism, query compilation join optimization, all stuff that was first developed in the context of RDBMSes. Surprise, surprise, when you apply tried and true techniques in the context of a new problem, you see drastic improvements. This is pretty much exactly the point that Stonebraker and the others above are making: MapReduce was kinda like the graph databases you tested: they were hyper-focused on one functionality, and missed the memo on decades of many other basic optimizations. They're certainly not the only one guilty of this. > I don't use databases because they are really quite bad at computation. Well, if computation's all you need ... I mean, I hope you're kidding here, but there are reasons other than performance that you'd want to have a parallel system, e.g. your working set doesn't fit in memory, or you need to minimize downtime. Granted, these are not problems that are common. Also there's many reasons you want a database over a hand-rolled solution: you need to concurrently serve a lot of queries, including insertions and updates, you have to do well on many different types of queries rather than just a single one, etc etc etc. Also, /what/ system? Bad at /what/ computation? There's so many different systems for so many different workloads that I can't believe you can seriously make such a statement. If you're saying RDBMSes are bad at graph computations, then sure. That's unsurprising. But that's not what we were talking about! :-/ |
The main contribution of EH is not the use of SIMD, it is the implementation of new WCO join execution strategies that hadn't been developed in the past 40 years of RDBMSes.
If you wanted that behavior, with its orders-of-magnitude performance improvements, you could not get it from an existing optimizing RDBMS---not HyPer, nor MonetDB, nor anything else in your list---but you could get it from a more programmable data-parallel system.
> Well, if computation's all you need ...
It is a thing I need, which is exactly the point. If the RDBMSes don't provide the performance (or anything within 100x) you can get from a more programmable system, you need a different solution.
Stonebraker's claim was that MR was a huge step backwards, which is BS to the extent that RDBMSes weren't solving the problems Google (and others) had. No amount of fantasy query optimization was going to take SQL to the performance of MR or MPI codes (even circa 2009, Vertica still had no support for UDFs).
You are of course welcome to list other things that RDBMSes are good at, and that's great. However, Stonebraker's claim isn't that RDBMSes have some value (which everyone I know agrees with), his claim is that MR was a shit model and everyone should be using RDBMSes instead (preferably his).
> If you're saying RDBMSes are bad at graph computations, then sure. That's unsurprising. But that's not what we were talking about! :-/
Remind me what that was, then? It seemed like we were talking about whether there was a heavy pro-RDBMS bias in the redbook, which I think is (i) fair, and (ii) fine. I also think Stonebraker is wrong in his claim that MR set things backwards because (as I referenced) RDBMSes weren't there to be set back from. If anything, it prompted a great deal of new work that led to improvements in areas he was blind to. A concrete example of this (e.g. iterative computation) seems totally on-topic.