| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ddorian43 1591 days ago

> For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). Our write took 300ms and generated 38 messages over the cluster.

No. The "proxy" node, a random node that you connect to will do:

0. split the file into chunks of ~4MB (can be changed) while streaming

for each chunk (you can write chunks in parallel):

1. get id from master (can be fixed by generating an id in the proxy node with some custom code, 0 messages with custom plugin)

2. write to 1 volume-server (which will write to another node for majority) (2 messages)

3. update metadata layer, to keep track of chunks so you can resume/cancel/clean-failed uploads (metadata may be another raft subsystem, think yugabytedb/cockroachdb, so it needs to do it's own 2 writes) (2 messages)

Mark as "ok" in metadata layer and return ok to client. (2 messages)

The chunking is more complex, you have to track more data, but in the end is better. You spread a file to multiple servers & disks. If a server fails with erasure-coding and you need to read a file, you won't have to "erasure-decode" the whole file since you'll have to do it only for the missing chunks. If you have a hot file, you can spread reads on many machines/disks. You can upload very-big-files (terabytes), you can "append" to a file. You can have a smart-client (or colocate a proxy on your client server) for smart-routing and stuff.

2 comments

ricardobeat 1591 days ago

If you're still talking about SeaweedFS, the answer seems to be that it's not a "raft-based object store", hence it's not as chatty as the parent comment described.

That proxy node is a volume server itself, and uses simple replication to mirror its volume on another server. Raft consensus is not used for the writes. Upon replication failure, the data becomes read-only [1], thus giving up partition tolerance. These are not really comparable.

[1] https://github.com/chrislusf/seaweedfs/wiki/Replication

link

vlovich123 1591 days ago

How does step 1 work? My understanding is that the ID from the master tells you which volume server to write to. If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?

link

ddorian43 1590 days ago

> If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?

You just need a unique id (which you generate it locally). And you need an writable volume-id, which you can query the master, master-follower, cache it, or query a volume-server directly.

link

chrislusf 1591 days ago

In snowflake id generation mode, the "which volume is writable" information can be read from other follower masters.

link