| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nmjenkins 4566 days ago

(Author of the spec here)

A fixed-size Merkle tree is a great idea if we wanted to sync the entire set of messages between distributed MX destinations. In fact, we should probably consider that approach for our server <-> server replication. However, JMAP is aimed at client <-> server sync. There are a number of different issues to consider here. The client may have limited or no cache (a webmail app for instance must normally start anew each time, and needs to efficiently fetch just what it needs; if we had to wait for a 60GB mailbox to sync every time we loaded our webmail, well, obviously that's not going to work. A mobile mail client often only wants to download the first few items in the mailbox given the current sort order so it can display them to the client. The JMAP spec allows a client to download, say, just the list of the first 50 messages, newest first, in the Inbox (single round trip). Then, at a later point, it can ask for and receive a space-efficient delta update of what has changed within those first 50 messages in the Inbox, regardless of how big the Inbox really is, or what other messages may have been delivered elsewhere (including lower down the Inbox, but outside the top 50). Again in a single round trip. This is really powerful: you can see that by comparing the refresh time of a folder in the FastMail mobile web app with the iOS or Android native mail app (over IMAP); the mobile web app is much quicker.

The other thing to remember is that the metadata about the message is mutable, and needs to be kept in sync as well. Again, the modseq system described in JMAP allows this to happen efficiently. If you have a distributed server system with multiple masters, you will again use a different system to keep those in sync (as they each assign new modseqs, whereas the client does not). However, this too can be dealt with relatively easily: if the modseq for a message differs between two masters, it must be set on both to max(the higher of the two modseqs, the next highest global modseq on the server with the currently lower modseq). This will ensure that both will end up in the same state, and that the client will get refresh its state for that message if it has to, no matter which master it last synced with.

So essentially, there's a difference in what you want in a protocol for server <-> server synchronisation for distributed mail backends compared to client <-> server sync for mail apps. JMAP is focussed on the latter, but tries to make sure it doesn't include assumptions that severely limit the former. The spec does not require a particular format of message ids: they do not have to be based around IMAP's ascending UIDs (although the sync algorithm for partial mailbox listings may be slightly less efficient if not).

1 comments

bodyfour 4565 days ago

> if we had to wait for a 60GB mailbox to sync every time we loaded our webmail, well, obviously that's not going to work

First, the Merkle tree would just be of the IDs not the message contents. I.e. the problem it is trying to solve is for the heavy-weight client that wants to resync its idea of which message UIDs are contained in a mailbox. Which ones it then requests metadata (or full-contents) is up to it.

It's true that for a ephemeral client (like a webapp) you don't even want to get the whole list of 100K messages in your inbox (although you probably will ask the server for a count of them) However, here having a sequential message number hardly helps either. For example, when I click on "sort by date" in my mail client it shows me the messages by sending date, not by what integer IMAP UID they were given.

The problem of "give me the metadata for the top 100 messages sorted by criteria X" is just something that has to be offered by the server. This is no different than features like searching -- if bandwidth were infinite the client could do it, but it's not so you need to do the work near where the data is stored.

The server will have to maintain data structures to make these kinds of queries efficient. The important point however, is that these are conceptually just caches of information that is in the mailstore. Therefore there is no need to keep that synced between far-flung servers.

> The other thing to remember is that the metadata about the message is mutable, and needs to be kept in sync as well

Yes, message data and metadata are separate problems. If two servers both try to update the metadata for message X at the same instant one needs to win (which one is probably not that important) If two servers both try to add new messages to a folder at the same instant they both need to win, ideally without having to coordinate at all in advance.

For the metadata you can use a modseq, a Lamport timestamp, or probably even just a {wallclock time,server ID} tuple. Assuming good time synchronization is reasonable for a mail cluster. If two clients both try to change message flags within 10ms of each other it probably isn't important who wins as long as somebody definitively does.