Hacker News new | ask | show | jobs
by nonameiguess 209 days ago
It's possible I'm talking out of my ass and totally wrong because I'm basing this on principles, not benchmarking, but I'm pretty sure the problem is more etcd itself than boltdb. Specifically, the Raft protocol requires that the cluster leader's log has to be replicated to a quorum of voting members, who need to write to disk, including a flush, and then respond to the leader, before a write is considered committed. That's floor(n/2) + 1 disk flushes and twice as many network roundtrips to write any value. When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle, it's hard for that not to become a bottleneck. Other limitations include the 8GiB disk limit another comment mentions and etcd's hard-coded 1.5 MiB request size limit that prevents you from writing large object collections in a single bundle.

etcd is fine for what it is, but that's a system meant to be reliable and simple to implement. Those are important qualities, but it wasn't built for scale or for speed. Ironically, etcd recommends 5 as the ideal number of cluster members and 7 as a maximum based on Google's findings from running chubby, that between-member latency gets too big otherwise. With 5, that means you can't ever store more than 40GiB of data. I have no idea what a typical ratio of cluster nodes to total data is, but that only gives you about 307MiB per node for 130,000 nodes, which doesn't seem like very much.

There are other options. k3s made kine which acts as a shim intercepting the etcd API calls made by the apiserver and translating it into calls to some other dbms. Originally, this was to make a really small Kubernetes that used an embedded sqlite as its datastore, but you could do the same thing for any arbitrary backend by just changing one side of the shim.

1 comments

I run several clusters a bit over 10k nodes and the etcd db size is about 30-50GiB depending on how long ago defragmentation was run.

It is kindof sad as these nodes are running around 2k IOPS to the disk and are mostly sitting idle at the hardware level, but etcd still regularly chokes.

I did look into kine in the past, but I have no idea if it is suitable for running a high performance data store.

> When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle

The trick is you deploy your k8s clusters in multiple datacenters in the same region (think AZs in AWS term). The control plane can span multiple AZs which are in separate buildings, but close in geography. From the setups I work on the latency between datacenters in the same region is only about 500 microseconds.