|
So let's take the example of a 9-nodes clusters with a 100ms RTT over the network to understand. In this specific (yet a little bit artificial) situation, Garage particularly shines compared to Minio or SeaweedFS (or any Raft-based object store) while providing the same consistency properties. For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). In the end, our write took 300ms and generated 38 messages over the cluster. Another critical point with Raft is that your writes do not scale: they all have to go through your leader. So on the writes point of view, it is not very different from having a single server. For a DynamoDB-like object store (Riak CS, Pithos, Openstack Swift, Garage), the gateway receives the request and know directly on which nodes it must store the writes. For Garage, we choose to store every writes on 3 different nodes. So the gateway sends the write request to the 3 nodes and waits that at least 2 nodes confirm the write (+ 100ms, 6 messages). In the end, our write took 100ms, generated 6 messages over the cluster, and the number of writes is not dependent on the number of (raft) nodes in the cluster. With this model, we can still provide always up to date values. When performing a read request, we also query the 3 nodes that must contain the data and wait for 2 of them. Because we have 3 nodes, wrote at least on 2 of them, and read on 2 of them, we will necessarily get the last value. This algorithm is discussed in Amazon's DynamoDB paper[0]. I reasoned in a model where there is no bandwidth, no CPU limit, no contention at all. In real systems, these limits apply, and we think that's another argument in favor of Garage :-) [0]: https://dl.acm.org/doi/abs/10.1145/1323293.1294281 |
No. The "proxy" node, a random node that you connect to will do:
0. split the file into chunks of ~4MB (can be changed) while streaming
for each chunk (you can write chunks in parallel):
1. get id from master (can be fixed by generating an id in the proxy node with some custom code, 0 messages with custom plugin)
2. write to 1 volume-server (which will write to another node for majority) (2 messages)
3. update metadata layer, to keep track of chunks so you can resume/cancel/clean-failed uploads (metadata may be another raft subsystem, think yugabytedb/cockroachdb, so it needs to do it's own 2 writes) (2 messages)
Mark as "ok" in metadata layer and return ok to client. (2 messages)
The chunking is more complex, you have to track more data, but in the end is better. You spread a file to multiple servers & disks. If a server fails with erasure-coding and you need to read a file, you won't have to "erasure-decode" the whole file since you'll have to do it only for the missing chunks. If you have a hot file, you can spread reads on many machines/disks. You can upload very-big-files (terabytes), you can "append" to a file. You can have a smart-client (or colocate a proxy on your client server) for smart-routing and stuff.