Hacker News new | ask | show | jobs
by tony_landis 4614 days ago
I think it was Rich Hickey that said the document database is the worst of them all, because you are now married to that structure.

Having used Couchdb in production for two years, I have to agree with his analysis, and offer my own opinion that Couchdb is highly overrated. Not because it is not a good implementation of a document style database, but because the document store itself is not a good match for most use cases.

If the only requirement is a replicated JSON document store, it may work OK for you. But so would Riak, Postgres and some others.

If you need to update the data in those documents or ever need to query the data in ways you did not initially envision, you will quickly find yourself missing features which even traditional SQL databases are very good at. Development is slower.

Writing map/reduce for queries seems particularly cumbersome, particularly if you prefer not to use Javascript. And you have to plug them into a textarea in a webpage interface, or manually put them into Couchdb over http using curl or some library that abstracts this away. Either way it is a degree of separation that makes the data feel more out of reach than through a console interface like psql or mysql.

Consider the scenario where you want to update the value in an attribute on several thousand, or even just several documents that match some criteria. In SQL, you would simply jump in the console and in a few seconds or minutes complete that as a transaction with something like:

> update table set col=val where criteria.

There is no such feature in Couchdb. You will need to write code to filter and fetch each matching document, manipulate it as needed, then write the entire thing back. All to update a few bits that hopefully were not nested too deep as that really increases the complexity of the code you will need to write.

As memracom stated, the replication is not perfect. My experience even on a low latency network is the only safe way to ensure a client can immediately read back what they just wrote is to pass them through the likes of haproxy and use a sticky session. Otherwise you have a good probability of getting a 404 after a POST (create) or stale data after a PUT (update).

So for what it is worth, here is my advice on choosing a database from an ease of development standpoint:

1) has as many features as you can, even if you don't need them initially 2) has top notch libraries for your language / framework 3) has relation awareness - do not denormalize unless you must 4) supports consistency 5) supports in place updates - easily filter and change values (doesn't apply to Datomic) 6) has tools to make schema changes / reshaping data is easy, and can be done online

Maybe 2 years ago Couchdb was a great solution. But with memory and ssd storage being so cheap and so much innovation with traditional and NoSQL DBs, I don't foresee myself deploying Couchdb again. If I did need a place to dump some semi-structured data, I find Amazon's hosted offerings more attractive.

3 comments

I mostly agree, CouchDB's data and query model only works well for a subset of use cases and anything past that subset (the line at which is vague and only really understandable after being burnt) makes life hard. I do think that the query model and general capabilities can (and will) be vastly improved, but its already taken too long.

However I dont agree that many (or any) other things are suitable alternatives for replicatable json stores, in this case where replication means peer to peer stores that can operate offline for any period of time.

My particular interest these days is building web applications that work well offline, its why I build PouchDB, while its entirely possible to build a home made sync solution on top of your favourite database, its an extremely hard problem and something I see app after app try and fail constantly.

If I didnt need the ability sync data that would work offline, I wouldnt use CouchDB (pouchdb/cloudant etc), but since that is what I am interested in, right now I think its pretty much the only choice.

I think the kind of thing you are doing is the sweet spot for couch. Perhaps over time it will be less so, but I now I think it is pretty unique.
I agree with some but not all of what you write.

0) CouchDB will never lose your data. Period. Not many other stores are 'append only, copy on write'. If you're data is transient, you may not care about that, but many apps expect the DB to never lose or corrupt data. Take it down with 'kill -9'? no problem, it's guaranteed to be consistent on disk.

1) I think document DB's are as good or better than a key value store like riak. It's great to have the choice, at a later point in time, to reach inside your documents, build indexes, etc.

2) The biggest wart with couchdb from a scaling point is the single server, master-slave, and master-master. There is no dynamo style clustering, ala cassandra, risk, etc. We added that in our own stack in '09 and it's finally hit the Apache CouchDB repo in a refined state, you'll see it in Apache CouchDB 2.0

3) Finally, the biggest wart from a usability standpoint is the need to build materialized views. Ad hoc queries are painful. In Apache CouchDB most folks use Elastic Search in conjunction. In Cloudant we embedded lucene into each cluster node so you can do the the obvious things: 'GET http://...?q=name:"Mik*" AND age:[25 TO 34] & sort...'

Good points. Replication is certainly not painless to setup, and I've had trouble with continuous mode simply failing. I'm sure those warts will be worked out though.

Lucene and Elastic Search go a long way, it is just one more service to configure and maintain. Thats been an annoyance for me. If Lucene could be built right into couchdb that would be a major improvement.

However, that still doesn't let me cherry pick the values I want from a document. When the app is in a dynamic language, the cost of deserialization can add up.

Building some kind of xpath expressions to pull out specific parts of the doc would free up developers from spending as much time writing views, and would likely be much more performant to have that operation take place server-side. Maybe that should be an Elastic Search feature though and not Couch.

When criticising CouchDB, don't forget about its killer features:

Replication: slave, master, multi-master, pull, push, single, continuous over http(s), you name it.

Update handlers: You don't have to fetch, modify and save in every case.

MVCC semantics: Lock-free write access. Never, ever database dead-locks.

Here is the reference for update handlers for anyone wanting to check it out:

http://docs.couchdb.org/en/latest/ddocs.html#update-function...

It still requires writing code, and moving it into the database.

Once that is done, how do you call that function against an arbitrary list of documents and pass the new values to it without writing even more code somewhere?

This workflow of putting code/logic in the db is that it is forcing developers out of their preferred development environment, workflow, and most likely language.

Not to mention the fact that debugging all these couchdb functions and map/reduce calls becomes a nightmare. And testing - not sure how that could be done efficiently.

All of this this slows development.

It is possible to implement some web apps completely in static html, js, and couchdb, eliminating the need for anything server side. In those cases, couchdb is one of a kind.

Back in the 90s, Sybase Replication Server was pretty sweet, and we ran topologies you wouldn't believe...