| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ddorian43 1593 days ago

> independent components like the filer, the volume manager, the master, etc.

You can run volume/master/filer in a single server (single command).

> filer probably needs an external rdbms to handle the metadata

This is true. You can use an external db. Or build/embed some other db inside it (think a distributed kv in golang that you embed inside to host the metadata).

> It also seems that a bucket must be pinned to a volume server on SeaweedFS.

This is not true. A bucket will be using it's own volumes, but can be and is distributed on the whole cluster by default.

> They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or running for each transaction, which creates a huge bottleneck.

Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.

> Better scalability: because there is no special node, there is no bottlenecks. I suppose that SeaweedFS, all the requests have to pass through the master. We do not have such limitations.

Going to the master is only needed for writes, to get a unique id. This can be easily fixed with a plugin to say, generate twitter-snowflake-ids which are very efficient. For reads, you keep a cache in your client for the volume-to-server mapping so you can do reads directly from the server that has the data, or you can randomly query a server and it will handle everything underneath.

I'm pretty sure seaweedfs has very good fundamentals from researching all other open-source distributed object storage systems that exists.

1 comments

lxpz 1593 days ago

> Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.

We have synchronous writes without Raft, meaning we are both much faster and still strongly consistent (in the sense of read-after-write consistency, not linearizability). This is all thanks to CRDTs.

link

ddorian43 1593 days ago

> This is all thanks to CRDTs.

If you don't sync immediately, you may lose the node without it replicating yet and losing the data forever. There's no fancy algorithm when the machine gets destroyed before it replicated the data. And you can't write to 2 replicas simultaneously from the client like, say, when using a Cassandra-smart-driver since S3 doesn't support that.

CRDTs are nice but not magic.

link

superboum 1593 days ago

So let's take the example of a 9-nodes clusters with a 100ms RTT over the network to understand. In this specific (yet a little bit artificial) situation, Garage particularly shines compared to Minio or SeaweedFS (or any Raft-based object store) while providing the same consistency properties.

For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). In the end, our write took 300ms and generated 38 messages over the cluster.

Another critical point with Raft is that your writes do not scale: they all have to go through your leader. So on the writes point of view, it is not very different from having a single server.

For a DynamoDB-like object store (Riak CS, Pithos, Openstack Swift, Garage), the gateway receives the request and know directly on which nodes it must store the writes. For Garage, we choose to store every writes on 3 different nodes. So the gateway sends the write request to the 3 nodes and waits that at least 2 nodes confirm the write (+ 100ms, 6 messages). In the end, our write took 100ms, generated 6 messages over the cluster, and the number of writes is not dependent on the number of (raft) nodes in the cluster.

With this model, we can still provide always up to date values. When performing a read request, we also query the 3 nodes that must contain the data and wait for 2 of them. Because we have 3 nodes, wrote at least on 2 of them, and read on 2 of them, we will necessarily get the last value. This algorithm is discussed in Amazon's DynamoDB paper[0].

I reasoned in a model where there is no bandwidth, no CPU limit, no contention at all. In real systems, these limits apply, and we think that's another argument in favor of Garage :-)

[0]: https://dl.acm.org/doi/abs/10.1145/1323293.1294281

link

ddorian43 1593 days ago

> For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). Our write took 300ms and generated 38 messages over the cluster.

No. The "proxy" node, a random node that you connect to will do:

0. split the file into chunks of ~4MB (can be changed) while streaming

for each chunk (you can write chunks in parallel):

1. get id from master (can be fixed by generating an id in the proxy node with some custom code, 0 messages with custom plugin)

2. write to 1 volume-server (which will write to another node for majority) (2 messages)

3. update metadata layer, to keep track of chunks so you can resume/cancel/clean-failed uploads (metadata may be another raft subsystem, think yugabytedb/cockroachdb, so it needs to do it's own 2 writes) (2 messages)

Mark as "ok" in metadata layer and return ok to client. (2 messages)

The chunking is more complex, you have to track more data, but in the end is better. You spread a file to multiple servers & disks. If a server fails with erasure-coding and you need to read a file, you won't have to "erasure-decode" the whole file since you'll have to do it only for the missing chunks. If you have a hot file, you can spread reads on many machines/disks. You can upload very-big-files (terabytes), you can "append" to a file. You can have a smart-client (or colocate a proxy on your client server) for smart-routing and stuff.

link

ricardobeat 1593 days ago

If you're still talking about SeaweedFS, the answer seems to be that it's not a "raft-based object store", hence it's not as chatty as the parent comment described.

That proxy node is a volume server itself, and uses simple replication to mirror its volume on another server. Raft consensus is not used for the writes. Upon replication failure, the data becomes read-only [1], thus giving up partition tolerance. These are not really comparable.

[1] https://github.com/chrislusf/seaweedfs/wiki/Replication

link

vlovich123 1593 days ago

How does step 1 work? My understanding is that the ID from the master tells you which volume server to write to. If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?

link

ddorian43 1592 days ago

> If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?

You just need a unique id (which you generate it locally). And you need an writable volume-id, which you can query the master, master-follower, cache it, or query a volume-server directly.

link

chrislusf 1593 days ago

In snowflake id generation mode, the "which volume is writable" information can be read from other follower masters.

link

lxpz 1593 days ago

We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation. We are using CRDTs not so much for their asynchronous replication properties (what is usually touted as eventual consistency), but more as a way to avoid conflicts between concurrent operations so that we don't need a consensus algorithm like Raft. By combining this with the quorum system (two writes out of three need to be successfull before returning ok), we ensure durability of written data but without having to pay the synchronization penalty of Raft.

link

Ne02ptzero 1593 days ago

> We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation [...] we ensure durability of written data but without having to pay the synchronization penalty of Raft.

This is, in essence, strongly-consistent replication; in the sense that you wait for a majority of writes before answering a request: So you're still paying the latency cost of a round trip with a least another node on each write. How is this any better than a Raft cluster with the same behavior? (N/2+1 write consistency)

link

lxpz 1593 days ago

Raft consensus apparently needs more round-trips than that (maybe two round-trips to another node per write?), as evidenced by this benchmark we made against Minio:

https://garagehq.deuxfleurs.fr/documentation/design/benchmar...

Yes we do round-trips to other nodes, but we do much fewer of them to ensure the same level of consistency.

This is to be expected from a distributed system's theory perspective, as consensus (or total order) is a much harder problem to solve than what we are doing.

We haven't (yet) gone into dissecating the Raft protocol or Minio's implementation to figure out why exactly it is much slower, but the benchmark I linked above is already strong enough evidence for us.

link

nh2 1593 days ago

I think it would be great if you could make a Github repo that is just about summarising performance characteristics and roundrip types of different storage systems.

You would invite Minio/Ceph/SeaweedFS/etc. authors to make pull requests in there to get their numbers right and explanations added.

This way, you could learn a lot about how other systems work, and users would have an easier time choosing the right system for their problems.

Currently, one only gets detailed comparisons from HN discussions, which arguably aren't a great place for reference and easily get outdated.

link

withinboredom 1593 days ago

RAFT needs a lot more round trips that that, it needs to send a message about the transaction, the nodes need to confirm that they received it, then the leader needs to send back that it was committed (no response required). This is largely implementation specific (etcd does more round trips than that IIRC), but that's the bare minimum.

link