Hacker News new | ask | show | jobs
by hedgehog 3947 days ago
Rough guess on the design: The main metadata history is in a primary log, then each client extends that with its own branch of the history for filesystem operations. Logs are probably some kind of Merkle tree. Bulk data is content addressed and referenced from the logs, maybe with large objects split using a rolling checksum like Adler-32 to reduce object sizes and allow for partial updates to large objects. Someone (you guys?) runs a coordination service routes notifications between clients and manages leader election. The elected client reads all of the logs, does conflict resolution locally, and then updates the primary log and collects garbage.

One drawback to this design would be that many small files in the filesystem would translate to many small objects in S3 (with associated operations). One solution would be to put small objects right into the metadata log. Alternately they (or maybe all objects) could be put into a log-structured merge tree.

Another problem with this design is that S3 doesn't support append operations so sync latency would be bounded by client log flush intervals, again creating lots of small objects. Maybe the coordination service routes some of the data to manage this?

Anyway, really interesting design problem. Is this close?

1 comments

ObjectiveFS does all the coordination among the clients through S3, so there is no extra coordination service needed. This is why we are really happy that Amazon recently moved the S3 us-east-1 region to read-after-write consistency (like all their other regions).

We do write bundling before sending data to S3 so lots of small files would be packed together and stored in a single S3 object. This also helps reduce the number of object store operations.

You are absolutely right that frequent sync will necessarily create many small objects, which is why small objects will be combined into bigger ones (this compaction is done in the background). Sync latency is of course bounded by the S3 PUT time, since fsync(2) can only return after your data has been safely stored in S3.

It is a really interesting design problem. Thanks for sharing your ideas.