Hacker News new | ask | show | jobs
by alnayyir 5393 days ago
CouchDB is better'ish for larger datasets, but not for arbitrary scaling. MapReduce in CouchDB requires dumb full-scans if you're not just refreshing an existing view.

Arbitrarily large data is the exclusive domain of hadoop/hypertable/cassandra AFAIK atm.

2 comments

To be fair CouchDB is very explicit that to get any sort of performance, everything must be a view. "Ad-hoc queries" (i.e. queries that are written on the fly instead of uploaded as a view) are clearly stated as "for development only".

Where CouchDB really falls flat is for write-heavy applications. The default configuration in CouchDB is to not reindex a view until it has been read. When a read occurs, any new data in a view that was added since the last read must be re-indexed by executing the map/reduce functions on that data. If you're writing frequently to CouchDB but not reading a lot (as in a data warehouse) the first query you run is going to be extremely slow, since it will need to run map/reduce on a lot of new data. CouchDB doesn't distribute work to multiple nodes like Hadoop, and I've found even simple reduce functions to slow down re-indexing by a factor of 10. I think CouchDB has settings now to update the index on commit, or you could always run a cron job to regularly query the view and force a reindex, but it's still going to be slow.

BigCouch (https://cloudant.com/#!/solutions/bigcouch) might be a potential choice for data warehousing, since it advertises full compatibility with the CouchDB API but offers distributed map/reduce like Hadoop/Hive/etc. I haven't used it though.

Couch is definitely a lot more honest about their limitations than mongo or riak, but my experiences make me hesitant to recommend it to anyone not intimately familiar with those limitations.
This is part of what we are addressing with Couchbase Server, an autosharding rebalancing Couch fronted by memcached. For K/V read and write we measure microsecond latency.

We are currently optimizing the views for cluster access, but the design goal is to offer at least the query performance CouchDB offers on small datasets, even on very large clusters.

More info: http://blog.couchbase.com/couchbase-server-2-0-tour-and-demo

Is it going to have the same/similar multi-replication for enterprise only limitation that Riak has or is this going to be funded in some other way?