|
|
|
|
|
by alnayyir
5393 days ago
|
|
CouchDB is better'ish for larger datasets, but not for arbitrary scaling. MapReduce in CouchDB requires dumb full-scans if you're not just refreshing an existing view. Arbitrarily large data is the exclusive domain of hadoop/hypertable/cassandra AFAIK atm. |
|
Where CouchDB really falls flat is for write-heavy applications. The default configuration in CouchDB is to not reindex a view until it has been read. When a read occurs, any new data in a view that was added since the last read must be re-indexed by executing the map/reduce functions on that data. If you're writing frequently to CouchDB but not reading a lot (as in a data warehouse) the first query you run is going to be extremely slow, since it will need to run map/reduce on a lot of new data. CouchDB doesn't distribute work to multiple nodes like Hadoop, and I've found even simple reduce functions to slow down re-indexing by a factor of 10. I think CouchDB has settings now to update the index on commit, or you could always run a cron job to regularly query the view and force a reindex, but it's still going to be slow.
BigCouch (https://cloudant.com/#!/solutions/bigcouch) might be a potential choice for data warehousing, since it advertises full compatibility with the CouchDB API but offers distributed map/reduce like Hadoop/Hive/etc. I haven't used it though.