| > independent components like the filer, the volume manager, the master, etc. You can run volume/master/filer in a single server (single command). > filer probably needs an external rdbms to handle the metadata This is true. You can use an external db. Or build/embed some other db inside it (think a distributed kv in golang that you embed inside to host the metadata). > It also seems that a bucket must be pinned to a volume server on SeaweedFS. This is not true. A bucket will be using it's own volumes, but can be and is distributed on the whole cluster by default. > They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or running for each transaction, which creates a huge bottleneck. Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism. > Better scalability: because there is no special node, there is no bottlenecks. I suppose that SeaweedFS, all the requests have to pass through the master. We do not have such limitations. Going to the master is only needed for writes, to get a unique id. This can be easily fixed with a plugin to say, generate twitter-snowflake-ids which are very efficient. For reads, you keep a cache in your client for the volume-to-server mapping so you can do reads directly from the server that has the data, or you can randomly query a server and it will handle everything underneath. I'm pretty sure seaweedfs has very good fundamentals from researching all other open-source distributed object storage systems that exists. |
We have synchronous writes without Raft, meaning we are both much faster and still strongly consistent (in the sense of read-after-write consistency, not linearizability). This is all thanks to CRDTs.