|
|
|
|
|
by bodyfour
4518 days ago
|
|
Specifically, you want a Merkle tree where the final bucket is based on the lowest N bits of the message arrival time, probably at a granularity of a several seconds. In the typical sync case all of the unseen messages will then be clustered in a few buckets and the Merkle tree will be very efficient. For instance, if each bucket contains 8 seconds worth of arrival time, then 2^17 buckets mean they'll wrap around only every ~12 days. So if you are syncing after a day away there will be only 8% of the buckets potentially dirty (even if you get an incredible flood of mail!) Having the bucket use based on time also allows you to avoid some of the roundtrip costs when doing frequent syncing. In a single request, the client can speculatively send its hashes of buckets that are likely to have changed since its last sync time along with its root hash. The server finds which buckets changed, sending those new bucket contents -- it can then see that the top level hash must now agree on both sides and you're done in one RT. Of course it's possible that other buckets have changed in rare edge cases; in this case the top hashes will not match and you have to do the full Merkle descent to resync. |
|
There are other optimizations you can bring in when you sync the tree:
1. If you are working your way down the remote's tree and you notice that your local signature is zero for the equivalent remote's node signature, then you know that your entire subtree is empty, and you can short-circuit and start downloading entire buckets.
2. Conversely, if you are working your way down the remote's tree and you notice that your local signature is present but the equivalent remote's node signature is zero, then you know that your entire subtree has data, but the remote's entire subtree is empty, and you can short-circuit and start uploading entire buckets.
3. As you work your way down sections of the tree, you can start to build an idea of the average email to bucket ratio. If several buckets are only likely to contain at most one email, then you can short-circuit again.