|
|
|
|
|
by _urga
4518 days ago
|
|
For anti-entropy in a distributed system, one could use a combination of the following: 1. Get everything created since last sync. A created index (as opposed to updated index) would not need to be resorted for subsequent updates. This would get only new mail. 2. A fixed-size Merkle tree, 8 blocks per level, 7 levels deep, with around 262144 blocks at the base. This would be the best tree configuration for syncing from a few thousand up to 40 million emails. This would identify all mail updated since last connection, within 4 roundtrips with a minimum of bandwidth. The tree should be incrementally updated using XOR at both client and server whenever an update is made. It's possible to use a tree with less depth for fewer roundtrips but with support for less emails or requiring more bandwidth. 3. Push sync. Server would push all creates/updates to client while client is connected. |
|
For instance, if each bucket contains 8 seconds worth of arrival time, then 2^17 buckets mean they'll wrap around only every ~12 days. So if you are syncing after a day away there will be only 8% of the buckets potentially dirty (even if you get an incredible flood of mail!)
Having the bucket use based on time also allows you to avoid some of the roundtrip costs when doing frequent syncing. In a single request, the client can speculatively send its hashes of buckets that are likely to have changed since its last sync time along with its root hash. The server finds which buckets changed, sending those new bucket contents -- it can then see that the top level hash must now agree on both sides and you're done in one RT.
Of course it's possible that other buckets have changed in rare edge cases; in this case the top hashes will not match and you have to do the full Merkle descent to resync.