Hacker News new | ask | show | jobs
by adrn10 1595 days ago
So basically, we are a hosting association that wanted to put their servers at home _and_ sleep at night. This demanded inter-home redundancy of our data. But none of the existing solutions (MinIO, Ceph...) are designed for high inter-node latency. Hence Garage! Your cheap and proficient object store: designed to run on a toaster, through carrier-pigeon-grade networks, while still supporting a tons of workloads (static websites, backups, Nextcloud... you name it)
3 comments

I've been researching a solution for this problem for some time and you all have landed on exactly how I thought this should be architected. Excellent work and I foresee much of the future infrastructure of the internet headed in this direction.
I love that attitude towards self-hosting and it not being the sole hobby from that day on. Casual self-hosting if you will.

House-keeping UX is key for self-hosting by laypersons.

Did you try seaweedfs? Or contributing to any of them?
(Garage Contributor here) We reviewed many of the existing solutions and none of them had the feature set we wanted. Compared to SeaweedFS, the main difference we introduce with Garage is that our nodes are not specialized, which lead to the following benefits:

- Garage is easier to deploy and to operate: you don't have to manage independent components like the filer, the volume manager, the master, etc. It also seems that a bucket must be pinned to a volume server on SeaweedFS. In Garage, all buckets are spread on the whole cluster. So you do not have to worry that your bucket fills one of your volume server.

- Garage works better in presence of crashes: I would be very interested by a deep analysis of Seaweed "automatic master failover". They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or sending a request for each transaction, which creates a huge bottleneck in their design.

- Better scalability: because there is no special node, there is no bottlenecks. I suppose that with SeaweedFS, all the requests have to pass through the master. We do not have such limitations.

As a conclusion, we choose a radically different design with Garage. We plan to do a more in-depth comparison in the future, but even today, I can say that if we implement the same API, our radically different designs lead to radically different properties and trade-off.

> independent components like the filer, the volume manager, the master, etc.

You can run volume/master/filer in a single server (single command).

> filer probably needs an external rdbms to handle the metadata

This is true. You can use an external db. Or build/embed some other db inside it (think a distributed kv in golang that you embed inside to host the metadata).

> It also seems that a bucket must be pinned to a volume server on SeaweedFS.

This is not true. A bucket will be using it's own volumes, but can be and is distributed on the whole cluster by default.

> They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or running for each transaction, which creates a huge bottleneck.

Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.

> Better scalability: because there is no special node, there is no bottlenecks. I suppose that SeaweedFS, all the requests have to pass through the master. We do not have such limitations.

Going to the master is only needed for writes, to get a unique id. This can be easily fixed with a plugin to say, generate twitter-snowflake-ids which are very efficient. For reads, you keep a cache in your client for the volume-to-server mapping so you can do reads directly from the server that has the data, or you can randomly query a server and it will handle everything underneath.

I'm pretty sure seaweedfs has very good fundamentals from researching all other open-source distributed object storage systems that exists.

> Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.

We have synchronous writes without Raft, meaning we are both much faster and still strongly consistent (in the sense of read-after-write consistency, not linearizability). This is all thanks to CRDTs.

> This is all thanks to CRDTs.

If you don't sync immediately, you may lose the node without it replicating yet and losing the data forever. There's no fancy algorithm when the machine gets destroyed before it replicated the data. And you can't write to 2 replicas simultaneously from the client like, say, when using a Cassandra-smart-driver since S3 doesn't support that.

CRDTs are nice but not magic.

So let's take the example of a 9-nodes clusters with a 100ms RTT over the network to understand. In this specific (yet a little bit artificial) situation, Garage particularly shines compared to Minio or SeaweedFS (or any Raft-based object store) while providing the same consistency properties.

For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). In the end, our write took 300ms and generated 38 messages over the cluster.

Another critical point with Raft is that your writes do not scale: they all have to go through your leader. So on the writes point of view, it is not very different from having a single server.

For a DynamoDB-like object store (Riak CS, Pithos, Openstack Swift, Garage), the gateway receives the request and know directly on which nodes it must store the writes. For Garage, we choose to store every writes on 3 different nodes. So the gateway sends the write request to the 3 nodes and waits that at least 2 nodes confirm the write (+ 100ms, 6 messages). In the end, our write took 100ms, generated 6 messages over the cluster, and the number of writes is not dependent on the number of (raft) nodes in the cluster.

With this model, we can still provide always up to date values. When performing a read request, we also query the 3 nodes that must contain the data and wait for 2 of them. Because we have 3 nodes, wrote at least on 2 of them, and read on 2 of them, we will necessarily get the last value. This algorithm is discussed in Amazon's DynamoDB paper[0].

I reasoned in a model where there is no bandwidth, no CPU limit, no contention at all. In real systems, these limits apply, and we think that's another argument in favor of Garage :-)

[0]: https://dl.acm.org/doi/abs/10.1145/1323293.1294281

We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation. We are using CRDTs not so much for their asynchronous replication properties (what is usually touted as eventual consistency), but more as a way to avoid conflicts between concurrent operations so that we don't need a consensus algorithm like Raft. By combining this with the quorum system (two writes out of three need to be successfull before returning ok), we ensure durability of written data but without having to pay the synchronization penalty of Raft.
As someone who wants this technology to be useful to commercial and open-source entities GPL software is more compatible with the status quo and I don't want to fight the battle of the license and the technology.

I hope you reconsider AGPL3.

More-permissive-than-AGPLv3 for service-based software creates a potential (or, I'd argue, likely) route towards large-platform capture and resale of the value of those services without upstream contribution in return, followed at a later date by the turning-down and closure of the software products and communities behind them.

That's the strategy that the platforms have, I think - for rational (albeit potentially immoral) economic and competitive reasons. I'd like to hear if you disbelieve that and/or have other reasons to think that permissive licensing is superior long-term.

> In Garage, all buckets are spread on the whole cluster. So you do not have to worry that your bucket fills one of your volume server.

Are you saying I do not have to worry about ONE volume server getting full, but instead I can worry about ALL of them getting full at the same time?

Unless garage has a special mitigation against this, usually performance gets much worse in large clusters as the filesystem fills up. As files are added and deleted, it struggles to keep nodes exactly balanced and a growing percentage of nodes will be full and unavailable for new writes.

So in a high throughput system, you may notice a soft performance degradation before actually running out of space.

If you aren't performance sensitive or don't have high write throughput, you might not notice. This is definitely something you should be forecasting and alerting on so you can acquire and add capacity or delete old data.

If you really don't like the idea of everything falling over at the same time, you could use multiple clusters or come up with a quota system (e.g. disable workload X if it's using more than Y TiB).

Yes, in which case you just make your cluster bigger or delete some data. Seems like a reasonable compromise to me.

Garage has advanced functionnality to control how much data is stored on each node, and is also very flexible with respect to adding or removing nodes, making all of this very easy to do.

> Garage has advanced functionnality to control how much data is stored on each node,

Are nodes assumed to all be the same or similar size?

We don't currently have a comparison with SeaweedFS. For the record, we have been developping Garage for almost two years now, and we hadn't heard of SeaweedFS at the time we started.

To me, the two key differentiators of Garage over its competitors are as follows:

- Garage contains an evolved metadata system that is based on CRDTs and consistent hashing inspired by Dynamo, solidly grounded in distributed system's theory. This allows us to be very efficient as we don't use Raft or other consensus algorithms between nodes, and we also do not rely on an external service for metadata storage (Postgres, Cassandra, whatever) meaning we don't pay an additionnal communication penalty.

- Garage was designed from the start to be multi-datacenter aware, again helped by insights from distributed system's theory. In practice we explicitly chose against implementing erasure coding, instead we spread three full copies of data over different zones so that overall availability is maintained with no degradation in performance when one full zone goes down, and data locality is preserved at all locations for faster access (in the case of a system with three zones, our ideal deployment scenario).

Can you go into a bit more detail about how using CRDTs avoids the need for consensus-based replication?
Sure!

CRDTs are often presented as a way of building asynchronous replication system with eventual consistency. This means that when modifying a CRDT object, you just update your local copy, and then synchronize in background with other nodes.

However this is not the only way of using CRDTs. At there core, CRDTs are a way to resolve conflicts between different versions of an object without coordination: when all of the different states of the object that exist in the network become known to a node, it applies a local procedure known as a merge that produces a deterministic outcome: all nodes do this, and once they have all done it, they are all in the same state. In that way, nodes do not need to coordinate before-hand when doing modifictations, in the sense where they do not need to run a two-phase commit protocol that ensures that operations are applied one after the other in a specific order that is replicated identically at all nodes. (This is the problem of consensus which is theoretically much harder to solve from a distributed system's theory perspective, as well as from an implementation perspective).

In Garage, we have a bit of a special way of using CRDTs. To simplify a bit, each file stored in Garage is a CRDT that is replicated on three known nodes (deterministically decided). While these three nodes could synchronize in the background when an update is made, this would mean two things that we don't want: 1/ when a write is made, it would be written only on one node, so if that node crashes before it had a chance to synchronize with other nodes, data would be lost; 2/ reading from a node wouldn't necessarily ensure that you have the last version of the data, therefore the system is not read-after-write consistent. To fix this, we add a simple synchronization system based on read/write quorums to our CRDT system. More precisely, when updating a CRDT, we wait for the value to be known to at least two of the three responsible nodes before returning OK, which allows us to tolerate one node failure while always ensuring durability of stored data. Further when performing a read, we ask for their current state of the CRDT to at least two of the three nodes: this ensures that at least one of the two will know about the last version that was written (due to the intersection of the quorums being non-empty), making the system read-after-write consistent. These are basically the same principles that are applied in CRDT databases such as Riak.

Sounds interesting. How do you handle the case where you're unable to send the update to other nodes?

So an update goes to node A, but not to B and C. Meanwhile, the connection to the client may be disrupted, so the client doesn't know the fate of the update. If you're unlucky here, a subsequent read will ask B and C for data, but the newest data is actually on A. Right?

I assume there's some kind of async replication between the nodes to ensure that B and C eventually catch up, but you do have an inconsistency there.

You also say there is no async replication, but surely there must be some, since by definition there is a quorum, and updates aren't hitting all of the nodes.

I understand that CRDTs make it easier to order updates, which solves part of consistent replication, but you still need a consistent view of the data, which is something Paxos, Raft, etc. solve, but CRDTs separated across multiple nodes don't automatically give you that, unless I am missing something. You need more than one node in order to figure out what the newest version is, assuming the client needs perfect consistency.

True; we don't solve this. If there is a fault and data is stored only on node A and not nodes B and C, the view might be inconsistent until the next repair procedure is run (which happens on each read operation if an inconsistency is detected, and also regularly on the whole dataset using an anti-entropy procedure). However if that happens, the client that sends an update will not have received an OK response, so it will know that its update is in an indeterminate state. The only guarantee that Garage gives you is that if you have an OK, the update will be visible in all subsequent reads (read-after-write consistency).
I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms. Great mission and work by DeuxFluers team btw. Bonne chance!

Edit: Blogs on Riak's use of CRDTs:

http://christophermeiklejohn.com/erlang/lasp/2019/03/08/mono...

https://web.archive.org/web/20170801114916/http://docs.basho...

https://naveennegi.medium.com/rendezvous-with-riak-crdts-par...)

> I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms.

This is because "CRDT" is not about a new or different approach to replication, although for some reason this has become a widely held perception. CRDT is about a new approach to the _analysis_ of replication mechanisms, using order theory. If you read the original CRDT paper(s) you'll find old-school mechanisms like Lamport Clocks.

So when someone says "we're using a CRDT" this can be translated as: "we're using an eventually consistent replication mechanism proven to converge using techniques from the CRDT paper".

Thanks!

I'll give a bit more details about CRDTs, then.

The thing is, if you don't have CRDT, the only way to replicate things over nodes in such a way that they end up in consistent states, is to have a way of ordering operations so that all nodes apply them in the same order, which is costly.

Let me give you a small example. Suppose we have a very simple key-value storage, and that two clients are writing different values at the same time on the same key. The first one will invoke write(k, v1), and the second one will invoke write(k, v2), where v1 and v2 are different values.

If all nodes receive the two write operation but don't have a way to know which one came first, some will receive v1 before v2, and end up with value v2 as the last written values, and other nodes will receive v2 before v1 meaning the will keep v1 as the definitive value. The system is now in an inconsistent state.

There are several ways to avoid this.

The first one is Raft consensus: all write operations will go through a specific node of the system, the leader, which is responsible for putting them in order and informing everyone of the order it selected for the operations. This adds a cost of talking to the leader at each operation, as well as a cost of simply selecting which node is the leader node.

CRDT are another way to ensure that we have a consistent result after applying the two writes, not by having a leader that puts everything in a certain order, but by embedding certain metadata with the write operation itself, which is enough to disambiguate between the two writes in a consistent fashion.

In our example, now, the node that does write(k, v1) will for instance generate a timestamp ts1 that corresponded to the (approximate) time at which v1 was written, and it will also generate a UUID id1. Similarly, the nodes that does write(k, v2) will generate ts2 and id2. Now when they send their new values to other nodes in the network, they will send along their values of ts1, id1, ts2 and id2. Nodes now know enough to always deterministcally select either v1 or v2, consistently everywhere: if ts1 > ts2 or ts1 = ts2 and id1 > id2, then v1 is selected, otherwise v2 is selected (we suppose that id1 = id2 has a negligible probability of happening). In terms of message round-trips between nodes, we see that with this new method, nodes simply communicate once with all other nodes in the network, which is much faster than having to pass through a leader.

Here the example uses timestamps as a way to disambiguate between v1 and v2, but CRDTs are more general and other ways of handling concurrent updates can be devised. The core property is that concurrent operations are combined a-posteriori using a deterministic rule once they reach the storage nodes, and not a-priori by putting them in a certain order.

Basically, our association is trying to move away from file systems, for scalability and availability purposes. There's a similarity between RDBMS (say SQL) and filesystems, in the sense that they distribute like crap. Main reason being that they have too strong consistency properties. Although, one can use Garage as a file system, notably using WebDAV. It's a beta feature, called Bagage: https://git.deuxfleurs.fr/Deuxfleurs/bagage I'll let my knowledgeable fellows answer, if you'd like to hear about differences between seaweedfs and Garage in more detail :)
seaweedfs is an object storage at the core. In my (personal) case, I disqualify cause of missing erasure coding.
Garage says isn't a goal either.

https://garagehq.deuxfleurs.fr/documentation/design/goals/

"Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication."

Maybe they added it since you looked or did something not match your needs? https://github.com/chrislusf/seaweedfs/wiki/Erasure-coding-f...
I disqualify Garage because of missing erasure-coding. I know seaweedfs has it.