|
|
|
|
|
by lobster_johnson
3740 days ago
|
|
Last I checked, RethinkDB had poor aggregation performance — which is key to doing analytics, unless you want to use it purely as a master data store and do aggregations elsewhere. Some simple "group by" selects that would take ~3 seconds with Postgres or Elasticsearch would take several minutes with RethinkDB, unless it died of RAM starvage first. It looks to me like RethinkDB is not optimized for sequential read access, nor is its caching algorithm tuned to such workloads. I believe it also lacks many of the aggregations that you'll want to use, like multi-level bucketing on different dimensions. This was ~1 year ago, though, so it may have massive improved since then, who knows. |
|
We're constantly improving performance and a lot has happened within the past year. I think that at this point RethinkDB is as good a database for analytics as many of the other general-purpose databases when it comes to features and performance.
From what I can tell, there are still two main limitations that apply in some, but not all scenarios:
* Grouping big results without an associated aggregation requires the full result to fit into RAM. I believe this was the limitation that you ran into a year ago, which lead to RAM exhaustion. This limitation is still there ( https://github.com/rethinkdb/rethinkdb/issues/2719 in our issue tracker). However we're shipping a new command `fold` with the upcoming 2.3 release of RethinkDB, which can be used in the vast majority of cases to perform streaming grouped operations (in conjunction with a matching index). See https://github.com/rethinkdb/rethinkdb/issues/3736 for details.
* Scanning data sets that don't fit into memory on rotational disks is still inefficient. Most SQL databases deploy sophisticated optimizations to structure their disk layout in order to minimize the effects of high seek times. RethinkDB's disk layout it built with a stronger focus on SSDs. This limitation hence doesn't apply if the data is stored on SSDs.