Hacker News new | ask | show | jobs
by norkakn 4195 days ago
Clustering is really, really hard.

Cassandra is one of the better ones out there, but you have to deal with its data model and weird consistency promises (which however weird you think they are, are weirder)

The correct way to cluster also changes dramatically depending on your use case. Sure there are things like RAC that promise to make it just work, but those don't scale more than a few nodes.

Mongo is kind of the worst in this - it clusters in one weird way, has bad tooling, and subtly destroys your data at scale.

The general philosophy with postgres is to do it right, or not do it. There are ways to do specific kinds of clustering, but all of them (just like mongo, oracle, etc) have a lot of nuances to them.

If you have a natural shard key, use a bunch of schemas and table inheritance, and eat the downtime during re-shards. Check out citus as well. They have their issues, but they can help you hook up what you need.

3 comments

> If you have a natural shard key, use a bunch of schemas and table inheritance,

They are working on improving FDW API to help make foreign tables available as inheritance children. I think It's a step forward...

This plus zookeeper to manage shard transitions would be awesome.
> Cassandra is one of the better ones out there

I take it you haven't actually used Cassandra much in the last few years. It's data model is almost identical to a typical relational one and it's consistency promises are quite clear:

http://www.datastax.com/documentation/cql/3.0/cql/aboutCQL.h...

And I've scaled Cassandra clusters from 1 to 100 nodes in hours with no issues. It really is quite simple. Likewise have had no issues with MongoDB replica sets. It is definitely not "really, really hard".

> postgres is to do it right, or not do it

What a pathetic cop out. PostgreSQL has been around for decades they've had plenty of time to have a proven, stable solution implemented.

Cassandra avoids some of the really visible issues by being AP instead of CP. Hbase hits them, but dodges a bit by only having row level consistency. They are solving very different problems.

The lack of vector clocks in Cassandra can lead to some very non-intuitive (possible wrong) behavior - check out their counter implementation for some rage on that. It's pretty well made though, and I think C*, Hbase and Postgres all have great uses (along with Redis, and a lot of others)

Mongo tends to get things subtly wrong in ways that corrupt data, or that don't scale, and it gives up both A and C.

It's quite simple until you write code that has to deal with AP (as in CAP), and the limitations that come with it. I have, and what a pain that was.
> Mongo is kind of the worst in this - it clusters in one weird way, has bad tooling, and subtly destroys your data at scale.

I'm genuinely interested. How does Mongo corrupts data? Thanks.

The most infuriating way is that it will roll back its oplog on an election, and sometimes throw away confirmed writes.

Mongos does weird magic as well when it gets confused, and will confirm writes to the wrong shards during "interesting" situations.

This is more than worrysome. I googled for it and found this http://docs.mongodb.org/v2.6/core/replica-set-rollbacks/

Is this behaviour what you are referring to?

> A rollback reverts write operations on a former primary when the member rejoins its replica set after a failover. A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.

Yeah, it's worse than it sounds though. If you write to 2/3 nodes, and the third gets elected, you can roll back majority confirmed writes. I think it did even worse things when near capacity, but those are harder to pin down.
So the "majority" write concern is not safe. However it seems that the number of nodes that must confirm a write can be configured http://docs.mongodb.org/manual/core/replica-set-write-concer... From what I read there it seems that if you want to get a confirmation from all the nodes and one gets offline you'll be waiting forever (but there is wtimeout). I'm afraid that any db cluster with any technology has similar problems.