Hacker News new | ask | show | jobs
by sergei 5609 days ago
1. Say I have a 2 node replica set. Now a replica dies, permanently. How is the recovery automated? These are quotes directly from your docs:

http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+R...

"1. Delete all data. If you stop the failed mongod, delete all data, and restart it, it will automatically resynchronize itself. Of course this may be slow if the database is huge or the network slow.

2. Copy data from another member. You can copy all the data files from another member of the set IF you have a snapshot of that member's data file's. This can be done in a number of ways. The simplest is to stop mongod on the source member, copy all its files, and then restart mongod on both nodes. The Mongo fsync and lock feature is another way to achieve this. On a slow network, snapshotting all the datafiles from another (inactive) member to a gziped tarball is a good solution. Also similar strategies work well when using SANs and services such as Amazon Elastic Block Service snapshots.

http://www.mongodb.org/display/DOCS/fsync+Command "Lock, Snapshot and Unlock

The fsync command supports a lock option that allows one to safely snapshot the database's datafiles. While locked, all write operations are blocked, although read operations are still allowed. After snapshotting, use the unlock command to unlock the database and allow locks again

2. Really? Is this wrong then?

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Conce...

"Writes which are committed at the primary of the set may be visible before the true cluster-wide commit has occurred. Thus we have "READ UNCOMMITTED" read semantics. These more relaxed read semantics make theoretically achievable performance and availability higher (for example we never have an object locked in the server where the locking is dependent on network performance).

1 comments

1. You really need a minimum of three replica set nodes, one of which can be a lightweight arbiter. If the primary fails, the secondary node will be promoted to primary automatically. In the case of a network partition, the old primary will come back up as a secondary with no problems. In the case of a true hardware failure, you can resync very quickly from a snapshot. For extra peace of mind, add more nodes to the replica set. You can have up to seven.

2. If you're reading from both primary and secondary nodes, then the view may not be consistent. In most cases you simply read from the primary for fully-consistent reads. You get to decide whether reads from secondaries are consistent or not by setting the write concern (i.e., the minimum number of nodes to replicate to before returning each write.)

1. Yes, I recognize that MongoDB will automatically fail over when we go from N nodes in the set to N - 1. But how do I get back to N nodes? That's completely manual.

2. What happens when I read an update that succeeded on the master but then later fails on the slaves?

1. It depends on how the node fails. If there's just a network partition, then you still have N nodes, so no issues. If you're running with durability enabled, and you experience, say, a power outage, then the member should rejoin the set and resync with no issues. If a node's drive crashes, then you'll need to restore from a recent snapshot (within a day or so) or perform a complete resync if you don't have snapshot. But this can all be done without taking the replica set offline. In that last case, there is some manual work involved. But your post, unless you've corrected it, implies that replica set failover is completely manual. That's certainly not true.

2. Outside of some kind of hardware failure, you won't have situations where writes succeed on the primary but fail on a secondary. And as I stated on your blog post, if you're really concerned about it, you can specify a write concern on insert, and if the write fails to replicate in the desired way, you'll know about it.

Sorry, but "hardware failure" is a fault, and when you can't deal with it, you're not tolerant. And with larger clusters, you see hardware faults on a regular basis. So saying we're ok in the nominal mode is not fault tolerance.