For what it's worth, we've managed to make CouchDB about 10x faster (throughput, not latency) in the last year, and we're just getting started on the optimizations.
Probably the biggest one was cutting down on the amount of Erlang message passing that it takes to append a batch of data to the end of the db file.
There was also an Erlang configuration change which allowed us to make use of a thread pool for disk-io so that fsyncs didn't block other activity.
Really it was a bunch of tiny things, that all added up. The catalyst was the creation and use of a few highly-concurrent benchmark suites that we could use to identify bottlenecks.